finally a bnode with a uri

Posts tagged with: semantic crunchbase

Writing Inference Rules with SPARQLScript

SPARQLScript can be used for forward chaining, including string manipulations on the run.
In order to keep data structures in Semantic CrunchBase close to the source API, I used a 1-to-1 mapping between CrunchBase JSON keys and RDF terms (with only a few exceptions). This was helpful for people knowing the JSON API, but it wasn't easy to interlink the converted information with existing SemWeb data such as FOAF, or the various LOD sources.

SPARQLScript is already heavily used by the Pimp-My-API tool or the TwitterBot, but yesterday I added a couple of new features and finally had a go at implementing a (forward chaining) rule evaluator (for the reasons mentioned some time ago).

A first version ("LOD Linker") is installed on Semantic CB, with initially 9 rules (feel free to leave a comment here if you need some additional mappings). With SPARQLScript being a superset of SPARQL+, most inference scripts are not much more than a single INSERT + CONSTRUCT query (you can click on the form's "Show inference scripts" button to see the source code):
$ins_count = INSERT INTO <${target_g}>
  CONSTRUCT {?res a foaf:Organization } WHERE {
    { ?res a cb:Company }
    UNION { ?res a cb:FinancialOrganization }
    UNION { ?res a cb:ServiceProvider }
    # prevent dupes
    OPTIONAL { GRAPH ?g { ?res a foaf:Organization } }
    FILTER(!bound(?g))
  }
  LIMIT 2000
But with the latest SPARQLScript processor (ARC release 2008-09-12) you can run more sophisticated scripts, such as the one below, which infers DBPedia links from wikipedia URLs:
$rows = SELECT ?res ?link WHERE {
    { ?res cb:web_presence ?link . }
    UNION { ?res cb:external_link ?link . }
    FILTER(REGEX(?link, "wikipedia.org/wiki"))
    # prevent dupes
    OPTIONAL { GRAPH ?g { ?res owl:sameAs ?v2 } . }
    FILTER(!bound(?g))
  }
  LIMIT 500

$triples = "";
FOR ($row in $rows) {
  # extract the wikipedia identifier
  $id = ${row.link.replace("/^.*\/([^\/\#]+)(\#.*)?$/", "\1")};
  # construct a dbpedia URI
  $res2 = "http://dbpedia.org/resource/${id}";
  # append to triples buffer
  $triples = "${triples} <${row.res}> owl:sameAs <${res2}> . "
}

#insert
if ($triples) {
  $ins_count = INSERT INTO <${target_g}> { ${triples} }
}

(I'm using a similar script to generate foaf:name triples by concatenating cb:first_name and cb:last_name.)

Inferred triples are added to a graph directly associated with the script. Apart from a destructive rule that removes all email addresses, the reasoning can easily be undone again by running a single DELETE query against the inferred graph.

I'm quite happy with the functionality so far. What's still missing is a way to rewrite bnodes, I don't think that's already possible. But INSERT + CONSTRUCT will leave bnode IDs unchanged, so the inference scripts don't necessarily require URI-denoted resources.

Another cool aspect of SPARQLScript-based inferencing is the possibility to use a federated set of endpoints, each processing only a part of a rule. The initial DBPedia mapper above, for example, uses locally available wikipedia links. However, CrunchBase only provides very few of those. So I created a second script which can retrieve DBPedia identifiers for local company homepages, using a combination of local queries and remote ones against the DBPedia SPARQL endpoint (in small iterations and only for companies with at least one employee, but it works).

A Faceted Browser for ARC

One of the first Trice components is probably going to be a faceted browser for ARC
I'm going on vacation in a couple of days, and before that, I'm trying to tick off at least a few of the bigger items on my ToDo list. I was hoping for a first Trice preview (now that ARC is slowly getting stable), but this will have to wait until September. However, I managed to get another component that's been on my list for ages into a demo-able state today: A SPARQL/ARC-based faceted browser (test installation at Semantic CrunchBase).

faceted browser

It's an early, but working (I think ;) version. A template mechanism for the item previews is still missing, but I'm already quite happy with the facet column. The facets are auto-generated (based on statistical info and scope-detection), but it's also possible to define custom filters (for more complicated graph patterns, see screenshot below). Once again, SPARQLScript simplified development, thanks to its placeholders for parameterized queries.

faceted browser administration

I think I'm going to use the browser for a first Trice bundle. It's not too sophisticated, but builds on several core features such as request dispatching, RDF/SPARQL-based views and forms, basic AJAX calls, and cached template sections.

Archives/Search

YYYY or YYYY/MM
No Posts found

Feeds