Posts tagged with: foaf

Semantic Web by Example: Semantic CrunchBase

C
Update: Wow, these guys are quick, there is now a full RSS feed for CrunchBoard jobs. I've tweaked the related examples.

This post is a bit late (I've even been TechCrunch'd already), but I wanted to add some features before I fully announce "Semantic CrunchBase", a Linked Data version of CrunchBase, the free directory of technology companies, people, and investors. CrunchBase recently activated an awesome API, with the invitation to build apps on top of it. This seemed like the ideal opportunity to test ARC and Trice, but also to demonstrate some of the things that become possible (or much easier) with SemWeb technology.

Turning CrunchBase into a Linked Dataset

The CB API is based on nicely structured JSON documents which can be retrieved through simple HTTP calls. The data is already interlinked, and each core resource (company, person, product, etc.) has a stable identifier, greatly simplifying the creation of RDF. Ideally, machine-readable representations would be served from crunchbase.com directly (maybe using the nicely evolving Rena toolkit), but the SemWeb community has a reputation of scaring away maintainers of potential target apps with complicated terminology and machinery before actually showing convincing benefits, so, at this stage (and given the nice API), it might make more sense to start with a separate site, and to present a selection of added values first.

For Semantic CrunchBase, I wrote a largely automated JSON2RDF converter, i.e. the initial RDF dataset is not using any know vocabs such as FOAF (of FOAFCorp). (We can INSERT mapping triples later, though.) Keeping most of the attribute names from the source docs (and mainly using just a single namespace) has another advantage besides simplified conversion: CrunchBase API users can more easily experiment with the SPARQL API (see twitter.json and twitter.rdf for a direct comparison).

An important principle in RDF land is the distinction between a resource and a page about a resource (it's very unlikely to hear an RDFer say "URLs are People" ;). This means that we need separate identifiers for e.g. Twitter and the Twitter description. There are different approaches, I decided to use (fake-)hash URIs which make embedding machine-readable data directly into the HTML views a bit more intuitive (IMHO):
  • /company/twitter#self denotes the company,
  • GETing the identifier resolves to /company/twitter which describes the company.
  • Direct RDF/XML or RDF/JSON can be retrieved by appending ".rdf" to the document URIs and/or via Content Negotiation.
This may sound a bit complicated (and for some reason RDFers love to endlessly discuss this stuff), but luckily, many RDF toolkits handle much of the needed functionality transparently.

The instant benefit of having linked data views is the possibility to freely explore the complete CrunchBase graph (e.g. from a company to its investors to their organizations to their relations etc.). However, the CrunchBase team has already done a great job, their UI already supports this functionality quite nicely, the RDF infrastructure doesn't really add anything here, functionality-wise. There is one advantage, but it's not obvious: An RDF-powered app can be extended at any time. On the data-level. Without the need for model changes (because there is none specified). And without the need for table tweaks (the DB schema is generic). We could, for example, enhance the data with CrunchBoard Jobs, DBPedia information, or profiles retrieved from Google's Social Graph API, without having to change a single script or table. (I switched to RDF as productivity booster some time ago and never looked back. The whole Semantic CrunchBase site took only a few person days to build, and most of the time was spent on writing the importer.) But let's skip the backstage benefits for now.

SPARQL - SQL for the Web

Tim Berners-Lee recently said that the success of the Semantic Web should be measured by the "level of unexpected reuse". While the HTML-based viewers support a certain level of serendipitous discovery, they only enable resource-by-resource exploration. It is not possible to spot non-predefined patterns such as "serial co-founders", or "founders of companies recently acquired". As an API provider, it is rather tricky to anticipate all potential use cases. On the CB API mailing list, people are expressing their interest in API methods to retrieve recent investments and acquisitions, or social graph fragments. Those can now only be coded and added by the API maintainers. Enter SPARQL. SPARQL, the protocol and query language for RDF graphs provides just this: flexibility for developers, less work for API providers. Semantic CrunchBase has an open SPARQL endpoint, but it's also possible to restrict/control the API while still using an RDF interface internally to easily define and activate new API methods. (During the last months I've been working for Intellidimension; they were using an on-request approach for AJAX front-ends. Setting up new API methods was often just a matter of minutes.)

With SPARQL, it gets easy to retrieve (almost) any piece of information, here is an example query that finds companies that were recently acquired:
SELECT DISTINCT ?permalink ?name ?year ?month ?code WHERE {
    ?comp cb:exit ?exit ;
          cb:name ?name ;
          cb:crunchbase_url ?permalink .

    ?exit cb:term_code ?code ;
          cb:acquired_year ?year ;
          cb:acquired_month ?month .
}
ORDER BY DESC (?year) DESC (?month)
LIMIT 20
(Query result as HTML)

Or what about a comparison between acquisitions in California and New York:
SELECT DISTINCT COUNT(?link_ca) as ?CA COUNT(?link_ny) as ?NY WHERE {
    ?comp_ca cb:exit ?exit_ca ;
             cb:crunchbase_url ?link_ca ;
             cb:office ?office_ca .
    ?office_ca cb:state_code "CA" .

    ?comp_ny cb:exit ?exit_ny ;
             cb:crunchbase_url ?link_ny ;
             cb:office ?office_ny .
    ?office_ny cb:state_code "NY" .
}
(Results)

These are just some simple examples, but they (hopefully) illustrate how RDF and SPARQL can significantly improve Web app development and community support. But hey, there is more.

Semantic Mashups with SPARQLScript

SPARQL has only just become a W3C recommendation, and the team behind it was smart enough to not add too many features (even the COUNT I used above is not part of the core spec). The community is currently experimenting with SPARQL extensions, and one particular thing that I'm personally very interested in is the creation of SPARQL-driven mashups through something called SPARQLScript (full disclosure: I'm the only one playing with it so far, it's not a standard at all). SPARQLScript enables the federation of script block execution across multiple SPARQL endpoints. In other words, you can integrate data from different sources on the fly.

Imagine you are looking for a job in California at a company that is at a specific funding stage. CrunchBase knows everything about companies, investments, and has structured location data. CrunchBoard on the other hand has job descriptions, but only a single field for City and State, and not the filter options to match our needs. This is where Linked Data shines. If we find a way to link from CrunchBoard to CrunchBase, we can use Semantic Web technology to run queries that include both sources. And with SPARQLScript, we can construct and leverage these links. Below is a script that first loads the CrunchBoard feed of current job offers (only the last 15 entries, due to common RSS' limitations/practices, the use of e.g. hAtom could allow more data to be pulled in). In a second step, it uses the company name to establish a pattern join between CrunchBoard and CrunchBase, which then allows us to retrieve the list of matching jobs at (at least) stage-A companies with offices in California.
PREFIX cboard: <http://www.crunchboard.com>
ENDPOINT <http://cb.semsol.org/sparql>
# refresh feed
if (${GET.refresh}) {
 # replaced <http://feeds.feedburner.com/CrunchboardJobs> with full feed
 LOAD <http://www.crunchboard.com/rss/affiliate/crunchboardrss_all.xml>
}
# let's query
$jobs = SELECT DISTINCT ?job_link ?comp_link ?job_title ?comp_name WHERE {
  # source: crunchboard, using full feed now
  GRAPH <http://www.crunchboard.com/rss/affiliate/crunchboardrss_all.xml> {
    ?job rss:link ?job_link ;
         rss:title ?job_title ;
         cboard:company ?comp_name .
  }
  # source: full graph
  ?comp a cb:Company ;
        cb:name ?comp_name ;
        cb:crunchbase_url ?comp_link ;
        cb:office ?office ;
        cb:funding_round ?round .
  ?office cb:state_code "CA" .
  ?round cb:round_code "a" .
}
(You can test it, this really works.)

Now that we are knee-deep in SemWeb geekery anyway, we can also add another layer to all of this and
  • allow parameterized queries so that the preferred state and investment stage can be freely defined,
  • add a browser-based tool for the collaborative creation of custom API calls
  • add a template mechanism for human-friendly results

I'll write about this "Pimp My API" app at Semantic CrunchBase in the next post. Here are some example API calls that were already created with it:
A lot of fun, more to come.

"Online Social Graph Consolidation" webinale Slides

S
I gave another talk at webinale2008, this one was about how SemWeb technology (XFN, RDF, FOAF, SPARQL, Inference) can help with the aggregation, integration, and consolidation of online social graph fragments spread across Web 2.0 services. Again, I tried to keep things demo-ish (using grawiki for Linked Data editing, and knowee for the integration and consolidation), so the slides themselves (available on slideshare) aren't too spectacular (and in german).

Got some SemWeb DOAP 'n' FOAF?

S
All baby steps, but I've activated a DOAP editor, an RDF/XML loader, and a basic browser store dump at RDFer.com. Would be great to get some DOAP files describing SemWeb projects in there, and maybe some FOAF files as well. That'd make coding the browsers more fun and a bit more real-world-ish.
Thanks for your help!

Merry X-Mas

F
FOAF in the snow
See you after the snow ;)

Term Shopping for Trackbacks and Projects

D
I already mentioned the nice HTTP vocab I'm using to describe page views in RDF. I had to add some custom properties to cover things like visits and access hosts, but the main part of the statistics module is built on top of the W3C vocab. The more I work with RDF, the less I feel comfortable with homegrown terms (although they can be handy for prototyping) and thus spend quite some time on the vocabulary market. Here are two other use cases I was gladly able to model with existing vocabs.

Trackbacks

I wanted to add support for incoming trackbacks to SemSol's blog module. Trackbacks consist of 4 parameters:
  • title (title of the remote post)
  • excerpt (excerpt of the remote post)
  • url (permalink of the remote post)
  • blog_name (name of the remote blog)
Additional local information:
  • date/time of the trackback (i.e. now)
  • permalink of the local post (derived from the trackback URL)
After a fruitful IRC chat with John "SIOC" Breslin, I'm now using (something similar to) the following code:
<$url> a rss:item ;
       an:annotates <$permalink> ;
       dc:title "$title" ;
       dc:description "$excerpt" ;
       dc:date "$now" ;
       dc:source [ dc:title "$blog_name"] .
I could have used rss:description instead of Dublin Core's but thought the structure could more easily be extended to local comments this way. Anyway, as you can see, trackbacks can nicely be described with DC, Annotea, and RSS 1.0.

Projects, Tools, Applications

The second use case comes from RDFer.com where I'd like to make some of the project and tools data collected during 2005 available. Additionally, I want to provide easy editing forms to let members describe and annotate RDF software. For SemanticWeb.org, we invented an swo:Application class to separate (developer) tools from (end-user) apps. But while analyzing the dataset, I saw that there are additional resource types which fit under the generic "project" concept, e.g. lists or data dumps. I was already in the middle of making up a whole bunch of classes when I remembered an earlier DCMI discussion about the negligible difference between dc:type and rdf:type which referred to DCMI Type definitions. Long story short, DC Types (dctype) combined with FOAF (foaf), DOAP (doap), and the DAML Tool vocab (tool) can be used to describe a whole range of resources:
  • general projects (foaf:Project)
  • software projects (doap:Project, which covers non-OS software as well)
  • resource collections (dctype:Collection, dctype:Dataset)
  • software products (dctype:Software or dctype:InteractiveResource, these could be used to e.g. attach tool:price properties which would perhaps look a bit odd on projects)
  • tools (tool:Tool)
  • online services (dctype:Service)
Something like dct:isPartOf could perhaps even be used to model sub-projects, but I'm not 100% sure.

Bottom line, again: no need for new terms, it's (often) all there already.

Proposals: a new RDF collection and an aluminium edition for FOAF

P
Nothing special to report from my side, just thought I should post something at least once a month. I'm still working on end-user-friendly RDF annotators, a SKOS editor, and started generalizing my RDF store API in order to eventually turn ARC into a complete RDF toolkit. CONFOTO is going to be upgraded as well.

FOAF- alu edition

But, of course, no plan without attractive hooks for distraction: The new CONFOTO server came with a merchandise shop, so I re-activated the 3D tool I used for the SemanticWeb.org banner this weekend and tried to design a Geek-Shirt for the upcoming SemWeb events I'm going to attend (Semantic Web Days in Munich, and ISWC in Galway). I think my shop is only available in German, maybe I should have a look at cafepress as well. And there's still this foaflets scene, anyone interested in making a shirt out of it? (Hm, does the FOAF project have a foaf:tipjar we could use for stuff like that)? However, a free T-Shirt for the first to add David Hasselhoff or another ex-star to the FOAF aluminium (FOAF Lite, ya know) edition.

FOAFlets of the Caribbean

3
While waiting for the US election results coming in last night, I couldn't really concentrate on programming. So I re-arranged my todo list for the semanticweb.org project a little bit and started playing around with Bryce5, a 3D renderer for non-3D-people. I'm thinking about using it for the generation of head graphics for some of the portal's editing tools, or for the site logo. Building the test-FOAFlets was really easy. A fun tool.

foaflets of the caribbean