finally a bnode with a uri

Posts tagged with: calais

Connecting the LOD dots with Calais 4.0 and Zemanta

A fun experiment using open data, RSS, Open Calais and Zemanta.
A couple of weeks ago I wrote about the exciting possibilities of LOD-enabled NLP APIs, and if they could bring another power-up to RDF developers by simplifying the creation of DIY Semantic Web apps. When Thomson Reuters released Calais 4.0 two days ago, I had a go.

The idea: Create a simple tool that aggregates bookmarks and microposts for a given set of tags (from Twitter,, Delicious, and ma.gnolia), pumps them through Calais and Zemanta, and then lets me browse the incoming stream of items based on typed entities, not just keywords. Something like a poor man's Twine, but with a fully-fledged SPARQL API and content automagically enhanced from LOD sources. Check out this month's Semantic Web Gang podcast for more details about Calais.

I set myself a time limit of one person day, so I ended up with just a very basic prototype, but it already shows the network effect kicking in when distributed data fragments can be connected through shared identifiers. Each of the discovered facets can be used as a smart filter (e.g. "Show me only items related to the Person Tim Berners-Lee"), and we could also pull in more information about the entities, as we know their respective LOD URI.

Wish I had funds to explore this a little more, but below is a screenshot showing the "HD Streams" test app in action. It's basically sending each micropost and bookmark to the APIs and then does lookups to DBPedia, Semantic CrunchBase and Freebase to retrieve additional type information. Plus a set of SPARQL+ INSERT queries to later accelerate the filtering.

There are some false positives (e.g. the Calais NLP service is typed as a place), but the APIs offer a score for each detection and I've set the barrier for inclusion very low. The interesting thing is that the grouping of items in the facets column is actually done via LOD information. The APIs only return IDs (or URIs), say, for Berlin, but this reference allows HD Streams to pull in more information and then associate Berlin with the "Place" filter.

This, however, is only the most simple use. The really exciting next step would be smart facets based on the aggregated information. Thanks to SPARQL, I could easily add filters that dive deeper into the LOD-enhanced graph. Like "Filter by posts related to Capitals in Europe", or related to places within a certain lat/long boundary, or with a population larger than x, or about products by competitors of y.

Something the prototype is not doing is expanding shortened URLs. Those could be normalized. Calais 4.0 does URL extraction already, this would just be another SPARQL query and a little PHP loop. Then we could add a simple ranking algorithm based on the number of tweets about a certain URL. The current app took just about 12 hours of work, RDF's extensible data model accelerated development through all stages of the process (well, ok, not during the design/theming phase ;). I didn't have to analyze the data coming from the two APIs at all. No pre-coding schema consideraions. I just loaded everything into my schema-free RDF store and then used incrementally improved graph queries to identify the paths I needed. For geeks: Below is the SPARQL+ snippet that injects LOD entity and label shortcuts from Zemanta results directly into the item descriptions ($res is the URL of an RSS item or bookmark. hds is the namespace prefix used by HD Streams):
INSERT INTO <' . $res . '> {
  <' . $res . '> hds:relatedEntity ?lod_entity .
  ?lod_entity hds:label ?label .
  <' . $res . '> hds:zemantaDoc ?z_doc .
  ?z_result z:doc ?z_doc ; z:confidence ?conf ; z:object ?z_entity .
  ?z_entity owl:sameAs ?lod_entity .
  ?lod_entity z:title ?label .
  FILTER(?conf > 0.2)
  FILTER(REGEX(str(?lod_entity), "(freebase|dbpedia|cb.semsol)"))

I've said it before, but it's worth repeating: RDF and SPARQL are great solutions for today's (and tomorrow's) data integration problem, but they are equally impressive as productivity boosters for software developers.

HD Streams
click for full-size version


No Posts found