finally a bnode with a uri

Posts tagged with: dbpedia

Connecting the LOD dots with Calais 4.0 and Zemanta

A fun experiment using open data, RSS, Open Calais and Zemanta.
A couple of weeks ago I wrote about the exciting possibilities of LOD-enabled NLP APIs, and if they could bring another power-up to RDF developers by simplifying the creation of DIY Semantic Web apps. When Thomson Reuters released Calais 4.0 two days ago, I had a go.

The idea: Create a simple tool that aggregates bookmarks and microposts for a given set of tags (from Twitter,, Delicious, and ma.gnolia), pumps them through Calais and Zemanta, and then lets me browse the incoming stream of items based on typed entities, not just keywords. Something like a poor man's Twine, but with a fully-fledged SPARQL API and content automagically enhanced from LOD sources. Check out this month's Semantic Web Gang podcast for more details about Calais.

I set myself a time limit of one person day, so I ended up with just a very basic prototype, but it already shows the network effect kicking in when distributed data fragments can be connected through shared identifiers. Each of the discovered facets can be used as a smart filter (e.g. "Show me only items related to the Person Tim Berners-Lee"), and we could also pull in more information about the entities, as we know their respective LOD URI.

Wish I had funds to explore this a little more, but below is a screenshot showing the "HD Streams" test app in action. It's basically sending each micropost and bookmark to the APIs and then does lookups to DBPedia, Semantic CrunchBase and Freebase to retrieve additional type information. Plus a set of SPARQL+ INSERT queries to later accelerate the filtering.

There are some false positives (e.g. the Calais NLP service is typed as a place), but the APIs offer a score for each detection and I've set the barrier for inclusion very low. The interesting thing is that the grouping of items in the facets column is actually done via LOD information. The APIs only return IDs (or URIs), say, for Berlin, but this reference allows HD Streams to pull in more information and then associate Berlin with the "Place" filter.

This, however, is only the most simple use. The really exciting next step would be smart facets based on the aggregated information. Thanks to SPARQL, I could easily add filters that dive deeper into the LOD-enhanced graph. Like "Filter by posts related to Capitals in Europe", or related to places within a certain lat/long boundary, or with a population larger than x, or about products by competitors of y.

Something the prototype is not doing is expanding shortened URLs. Those could be normalized. Calais 4.0 does URL extraction already, this would just be another SPARQL query and a little PHP loop. Then we could add a simple ranking algorithm based on the number of tweets about a certain URL. The current app took just about 12 hours of work, RDF's extensible data model accelerated development through all stages of the process (well, ok, not during the design/theming phase ;). I didn't have to analyze the data coming from the two APIs at all. No pre-coding schema consideraions. I just loaded everything into my schema-free RDF store and then used incrementally improved graph queries to identify the paths I needed. For geeks: Below is the SPARQL+ snippet that injects LOD entity and label shortcuts from Zemanta results directly into the item descriptions ($res is the URL of an RSS item or bookmark. hds is the namespace prefix used by HD Streams):
INSERT INTO <' . $res . '> {
  <' . $res . '> hds:relatedEntity ?lod_entity .
  ?lod_entity hds:label ?label .
  <' . $res . '> hds:zemantaDoc ?z_doc .
  ?z_result z:doc ?z_doc ; z:confidence ?conf ; z:object ?z_entity .
  ?z_entity owl:sameAs ?lod_entity .
  ?lod_entity z:title ?label .
  FILTER(?conf > 0.2)
  FILTER(REGEX(str(?lod_entity), "(freebase|dbpedia|cb.semsol)"))

I've said it before, but it's worth repeating: RDF and SPARQL are great solutions for today's (and tomorrow's) data integration problem, but they are equally impressive as productivity boosters for software developers.

HD Streams
click for full-size version

Semantic Web by Example: Semantic CrunchBase

CrunchBase is now available as Linked Data including a SPARQL endpoint and a custom API builder based on SPARQLScript.
Update: Wow, these guys are quick, there is now a full RSS feed for CrunchBoard jobs. I've tweaked the related examples.

This post is a bit late (I've even been TechCrunch'd already), but I wanted to add some features before I fully announce "Semantic CrunchBase", a Linked Data version of CrunchBase, the free directory of technology companies, people, and investors. CrunchBase recently activated an awesome API, with the invitation to build apps on top of it. This seemed like the ideal opportunity to test ARC and Trice, but also to demonstrate some of the things that become possible (or much easier) with SemWeb technology.

Turning CrunchBase into a Linked Dataset

The CB API is based on nicely structured JSON documents which can be retrieved through simple HTTP calls. The data is already interlinked, and each core resource (company, person, product, etc.) has a stable identifier, greatly simplifying the creation of RDF. Ideally, machine-readable representations would be served from directly (maybe using the nicely evolving Rena toolkit), but the SemWeb community has a reputation of scaring away maintainers of potential target apps with complicated terminology and machinery before actually showing convincing benefits, so, at this stage (and given the nice API), it might make more sense to start with a separate site, and to present a selection of added values first.

For Semantic CrunchBase, I wrote a largely automated JSON2RDF converter, i.e. the initial RDF dataset is not using any known vocabs such as FOAF (or FOAFCorp). (We can INSERT mapping triples later, though.) Keeping most of the attribute names from the source docs (and mainly using just a single namespace) has another advantage besides simplified conversion: CrunchBase API users can more easily experiment with the SPARQL API (see twitter.json and twitter.rdf for a direct comparison).

An important principle in RDF land is the distinction between a resource and a page about a resource (it's very unlikely to hear an RDFer say "URLs are People" ;). This means that we need separate identifiers for e.g. Twitter and the Twitter description. There are different approaches, I decided to use (fake-)hash URIs which make embedding machine-readable data directly into the HTML views a bit more intuitive (IMHO):
  • /company/twitter#self denotes the company,
  • GETing the identifier resolves to /company/twitter which describes the company.
  • Direct RDF/XML or RDF/JSON can be retrieved by appending ".rdf" to the document URIs and/or via Content Negotiation.
This may sound a bit complicated (and for some reason RDFers love to endlessly discuss this stuff), but luckily, many RDF toolkits handle much of the needed functionality transparently.

The instant benefit of having linked data views is the possibility to freely explore the complete CrunchBase graph (e.g. from a company to its investors to their organizations to their relations etc.). However, the CrunchBase team has already done a great job, their UI already supports this functionality quite nicely, the RDF infrastructure doesn't really add anything here, functionality-wise. There is one advantage, but it's not obvious: An RDF-powered app can be extended at any time. On the data-level. Without the need for model changes (because there is none specified). And without the need for table tweaks (the DB schema is generic). We could, for example, enhance the data with CrunchBoard Jobs, DBPedia information, or profiles retrieved from Google's Social Graph API, without having to change a single script or table. (I switched to RDF as productivity booster some time ago and never looked back. The whole Semantic CrunchBase site took only a few person days to build, and most of the time was spent on writing the importer.) But let's skip the backstage benefits for now.

SPARQL - SQL for the Web

Tim Berners-Lee recently said that the success of the Semantic Web should be measured by the "level of unexpected reuse". While the HTML-based viewers support a certain level of serendipitous discovery, they only enable resource-by-resource exploration. It is not possible to spot non-predefined patterns such as "serial co-founders", or "founders of companies recently acquired". As an API provider, it is rather tricky to anticipate all potential use cases. On the CB API mailing list, people are expressing their interest in API methods to retrieve recent investments and acquisitions, or social graph fragments. Those can now only be coded and added by the API maintainers. Enter SPARQL. SPARQL, the protocol and query language for RDF graphs provides just this: flexibility for developers, less work for API providers. Semantic CrunchBase has an open SPARQL endpoint, but it's also possible to restrict/control the API while still using an RDF interface internally to easily define and activate new API methods. (During the last months I've been working for Intellidimension; they were using an on-request approach for AJAX front-ends. Setting up new API methods was often just a matter of minutes.)

With SPARQL, it gets easy to retrieve (almost) any piece of information, here is an example query that finds companies that were recently acquired:
SELECT DISTINCT ?permalink ?name ?year ?month ?code WHERE {
    ?comp cb:exit ?exit ;
          cb:name ?name ;
          cb:crunchbase_url ?permalink .

    ?exit cb:term_code ?code ;
          cb:acquired_year ?year ;
          cb:acquired_month ?month .
ORDER BY DESC (?year) DESC (?month)
(Query result as HTML)

Or what about a comparison between acquisitions in California and New York:
SELECT DISTINCT COUNT(?link_ca) as ?CA COUNT(?link_ny) as ?NY WHERE {
    ?comp_ca cb:exit ?exit_ca ;
             cb:crunchbase_url ?link_ca ;
             cb:office ?office_ca .
    ?office_ca cb:state_code "CA" .

    ?comp_ny cb:exit ?exit_ny ;
             cb:crunchbase_url ?link_ny ;
             cb:office ?office_ny .
    ?office_ny cb:state_code "NY" .

These are just some simple examples, but they (hopefully) illustrate how RDF and SPARQL can significantly improve Web app development and community support. But hey, there is more.

Semantic Mashups with SPARQLScript

SPARQL has only just become a W3C recommendation, and the team behind it was smart enough to not add too many features (even the COUNT I used above is not part of the core spec). The community is currently experimenting with SPARQL extensions, and one particular thing that I'm personally very interested in is the creation of SPARQL-driven mashups through something called SPARQLScript (full disclosure: I'm the only one playing with it so far, it's not a standard at all). SPARQLScript enables the federation of script block execution across multiple SPARQL endpoints. In other words, you can integrate data from different sources on the fly.

Imagine you are looking for a job in California at a company that is at a specific funding stage. CrunchBase knows everything about companies, investments, and has structured location data. CrunchBoard on the other hand has job descriptions, but only a single field for City and State, and not the filter options to match our needs. This is where Linked Data shines. If we find a way to link from CrunchBoard to CrunchBase, we can use Semantic Web technology to run queries that include both sources. And with SPARQLScript, we can construct and leverage these links. Below is a script that first loads the CrunchBoard feed of current job offers (only the last 15 entries, due to common RSS' limitations/practices, the use of e.g. hAtom could allow more data to be pulled in). In a second step, it uses the company name to establish a pattern join between CrunchBoard and CrunchBase, which then allows us to retrieve the list of matching jobs at (at least) stage-A companies with offices in California.
PREFIX cboard: <>
# refresh feed
if (${GET.refresh}) {
 # replaced <> with full feed
 LOAD <>
# let's query
$jobs = SELECT DISTINCT ?job_link ?comp_link ?job_title ?comp_name WHERE {
  # source: crunchboard, using full feed now
  GRAPH <> {
    ?job rss:link ?job_link ;
         rss:title ?job_title ;
         cboard:company ?comp_name .
  # source: full graph
  ?comp a cb:Company ;
        cb:name ?comp_name ;
        cb:crunchbase_url ?comp_link ;
        cb:office ?office ;
        cb:funding_round ?round .
  ?office cb:state_code "CA" .
  ?round cb:round_code "a" .
(You can test it, this really works.)

Now that we are knee-deep in SemWeb geekery anyway, we can also add another layer to all of this and
  • allow parameterized queries so that the preferred state and investment stage can be freely defined,
  • add a browser-based tool for the collaborative creation of custom API calls
  • add a template mechanism for human-friendly results

I'll write about this "Pimp My API" app at Semantic CrunchBase in the next post. Here are some example API calls that were already created with it:
A lot of fun, more to come.


No Posts found