finally a bnode with a uri

Posts tagged with: sparqlscript

ESWC 2009 Linked Data Dashboards

A first Paggr application went live during ESWC2009.
In case you missed the tweets or a local announcement: The first Paggr application went online a few days ago. This year's ESWC Technologies Team pushed things a little further, with RFID tracking during the event and extended conference data that includes detailed session and date/time information (kudos to Michael Hausenblas for RDFizing even PDFs).

Based on this dataset, we provided a conference explorer and stress-tested the "Dog Food" server while at it. The system survived, but I also learned a lot. We used about 50 RDF stores for the different public and user-specific dashboards, which basically worked nicely. However, rendering non-ugly resource summaries requires a bit of endpoint hammering, and some of the more complex path queries resulted in timeouts. Yesterday, I had to create a mirror from the data dump to route a couple of widgets through a replicated (ARC :-) endpoint. But then this is also one of the powerful possibilities that come with semantic web technologies. You can often switch or double the back-end repository in no time, and without any code changes. (And as all the Sparqlets are created in a web-based tool, I didn't even have to upload a changed configuration file. I simply tweaked a SPARQLScript parameter.)

Anyway, there are a couple of public dashboards, in case you'd like to give it a try (it's still an early version), I also embedded a short screencast below. The system is going to be moved to a DERI server when the conference is over, but the URIs and data will probably stay stable. (And no, it won't really work with IE yet.) More to come!

HQ version (quicktime, 110MB)

SPARQLBot - Your Semantic Web Commandline

SPARQLBot is now officially launched
Update: I added a Ubiquity script after a suggestion by Gautier.

SPARQLBot - Your Semantic Web Commandline SPARQLBot, the weekend project we started at SemanticCamp London, is now finally online at a proper home, and with a more solid toolset. I've ported the essential commands from the old site, and the "Getting Started" manual should be online later today as well.

What is SPARQLBot?

SPARQLBot is a web-based service that reads and writes Semantic Web data based on simple, human-friendly commands received via IRC or the Web. The command base can be freely extended using a browser-based editor. SPARQLBot can process microformats, RSS, several RDF serializations, and results from parameterized SPARQL queries.

New Features

SPARQLBot was more or less rewritten from scratch. Compared to the earlier version, things have become much more powerful, but also more simple and stable in many cases. The system can now:
  • operate on multiple freenode IRC channels (just send "join #channel" to "sparqlbot"),
  • reply to private IRC messages,
  • be accessed via the Ubiquity plugin
  • reuse other commands,
  • call web APIs via GET or POST,
  • access arbitrary SPARQL endpoints,
  • help you cut your way through the growing Linked Data cloud,
  • use a single command to combine results from federated SPARQL endpoints and datasets such as DBPedia, DBLP, the SemWeb Conference Corpus, GeoNames , CrunchBase, or flickr wrappr,
  • produce highly customizable output via SPARQL result templates,
  • OpenID-protect your commands,
  • cache results in a local SPARQL+-enabled store.
(please see the manual for details)

If you happen to be at ISWC next month and would like to have a look behind the scenes, I'll present SPARQL+ and SPARQLScript with SPARQL result templates during the poster session.

Writing Inference Rules with SPARQLScript

SPARQLScript can be used for forward chaining, including string manipulations on the run.
In order to keep data structures in Semantic CrunchBase close to the source API, I used a 1-to-1 mapping between CrunchBase JSON keys and RDF terms (with only a few exceptions). This was helpful for people knowing the JSON API, but it wasn't easy to interlink the converted information with existing SemWeb data such as FOAF, or the various LOD sources.

SPARQLScript is already heavily used by the Pimp-My-API tool or the TwitterBot, but yesterday I added a couple of new features and finally had a go at implementing a (forward chaining) rule evaluator (for the reasons mentioned some time ago).

A first version ("LOD Linker") is installed on Semantic CB, with initially 9 rules (feel free to leave a comment here if you need some additional mappings). With SPARQLScript being a superset of SPARQL+, most inference scripts are not much more than a single INSERT + CONSTRUCT query (you can click on the form's "Show inference scripts" button to see the source code):
$ins_count = INSERT INTO <${target_g}>
  CONSTRUCT {?res a foaf:Organization } WHERE {
    { ?res a cb:Company }
    UNION { ?res a cb:FinancialOrganization }
    UNION { ?res a cb:ServiceProvider }
    # prevent dupes
    OPTIONAL { GRAPH ?g { ?res a foaf:Organization } }
  LIMIT 2000
But with the latest SPARQLScript processor (ARC release 2008-09-12) you can run more sophisticated scripts, such as the one below, which infers DBPedia links from wikipedia URLs:
$rows = SELECT ?res ?link WHERE {
    { ?res cb:web_presence ?link . }
    UNION { ?res cb:external_link ?link . }
    FILTER(REGEX(?link, ""))
    # prevent dupes
    OPTIONAL { GRAPH ?g { ?res owl:sameAs ?v2 } . }
  LIMIT 500

$triples = "";
FOR ($row in $rows) {
  # extract the wikipedia identifier
  $id = ${"/^.*\/([^\/\#]+)(\#.*)?$/", "\1")};
  # construct a dbpedia URI
  $res2 = "${id}";
  # append to triples buffer
  $triples = "${triples} <${row.res}> owl:sameAs <${res2}> . "

if ($triples) {
  $ins_count = INSERT INTO <${target_g}> { ${triples} }

(I'm using a similar script to generate foaf:name triples by concatenating cb:first_name and cb:last_name.)

Inferred triples are added to a graph directly associated with the script. Apart from a destructive rule that removes all email addresses, the reasoning can easily be undone again by running a single DELETE query against the inferred graph.

I'm quite happy with the functionality so far. What's still missing is a way to rewrite bnodes, I don't think that's already possible. But INSERT + CONSTRUCT will leave bnode IDs unchanged, so the inference scripts don't necessarily require URI-denoted resources.

Another cool aspect of SPARQLScript-based inferencing is the possibility to use a federated set of endpoints, each processing only a part of a rule. The initial DBPedia mapper above, for example, uses locally available wikipedia links. However, CrunchBase only provides very few of those. So I created a second script which can retrieve DBPedia identifiers for local company homepages, using a combination of local queries and remote ones against the DBPedia SPARQL endpoint (in small iterations and only for companies with at least one employee, but it works).

A Faceted Browser for ARC

One of the first Trice components is probably going to be a faceted browser for ARC
I'm going on vacation in a couple of days, and before that, I'm trying to tick off at least a few of the bigger items on my ToDo list. I was hoping for a first Trice preview (now that ARC is slowly getting stable), but this will have to wait until September. However, I managed to get another component that's been on my list for ages into a demo-able state today: A SPARQL/ARC-based faceted browser (test installation at Semantic CrunchBase).

faceted browser

It's an early, but working (I think ;) version. A template mechanism for the item previews is still missing, but I'm already quite happy with the facet column. The facets are auto-generated (based on statistical info and scope-detection), but it's also possible to define custom filters (for more complicated graph patterns, see screenshot below). Once again, SPARQLScript simplified development, thanks to its placeholders for parameterized queries.

faceted browser administration

I think I'm going to use the browser for a first Trice bundle. It's not too sophisticated, but builds on several core features such as request dispatching, RDF/SPARQL-based views and forms, basic AJAX calls, and cached template sections.

Pimp My (CrunchBase) API

Define your own CrunchBase API commands with SPARQL
In the Semantic CrunchBase announcement, we saw how SPARQL can be used to retrieve fine-grained information from the CrunchBase graph. This follow-up post explains "Pimp My API", a browser-based tool for creating tailored API calls by combining SPARQL with input parameters and output templating. The command editor consists of three tabs: "Define it", "Test it", and "Activate it".

Step 1: Define a new API command

PIMP MY API (1) In the 1st field ("Command") you define a (human-readable) command, with input parameters set via the ${parameter_name} notation. In the screenshot on the left, we created "${role} of ${comp_name}" which we are going to use to retrieve persons with a specific role at a given company. The command processor will automatically assign variables for a matching input string, e.g. "Editor of TechCrunch" will set the variable ${role} to "Editor", and ${comp_name} to "TechCrunch".

Now on to the 2nd field ("SPARQLScript code"):
SPARQLScript is an experiment to extend SPARQL with scripting language features such as variable assignments, loops, etc. (think Pipes for SemWeb developers). If you are familiar with SPARQL, you will notice only three differences to a standard SPARQL query: In the first line, we are setting a target SPARQL service for the following script blocks. In the second line, we assign the results form the SELECT query to a variable, and the the third difference is the use of placeholders in the query. These placeholder will be filled from matching variables before the query is sent to the target endpoint.

If you don't know SPARQL at all, here is a pseudo-translation of the query: Find resources (?comp) with a cb:name (cb is the CrunchBase namespace used for CB attributes) that equals the input parameter "comp_name", and a relationship (?rel). The relationship should have an attribute ?role which regex-matches the input parameter "role". The relationship should also have a cb:person attribute (linking to ?person). The ?person node should have the cb:first_name and cb:last_name attributes. Those should be returned by the query as "fname" and "lname" variables. The whole result set is then assigned to a variable named "rows" (Hmm, maybe the SPARQL is easier to read than my explanation ;)

The third form field lets us define an output template. Each stand-lone section surrounded by quotation marks will fill the output buffer. Thus, looping through the "rows" will create a small name snippet for each row. Again, placeholders will be filled with values from the current script scope.

Step 2: Test your new Command

PIMP MY API (2) Using the Test form, we can see if our command pattern works, and if the result is formatted as desired. Should anything go wrong, we can select "Show raw output" to get some debugging information. Please note, even though we are using a browser, simple HTML forms, and a friendly pattern language, the commands are sent to real Web services. A broken script usually just hurts your local machine. A distributed Semantic Web processor like this, however, may harm other people's servers, so we should be careful, start small, and improve our script incrementally. In this case, the output result is a little ugly, so we could improve the output template and inject commas:

Step 3: HTTP access activation

Our command is now defined and successfully tested, let's turn it into a public API call.
Instead of the sort-of natural language command, the API expects GET or POST arguments.

The example above generates a plain text result, but it's also possible to return markup or other formats. SPARQLScript can access GETvariables via ${GET.var_name}, this feature can be used to create different output, depending on e.g. a "format" parameter. I'm also working on support for content negotiation, where you'd simply create a "${rows}" template and the SPARQLScript processor would auto-generate an appropriate serialization including correct HTTP headers.

Step 4: Have some fun

You may wonder why the command editor allows the definiton of a human-friendly pattern, when the API itself just needs the parameters. The patterns allow the implementation of an API call detector, i.e. depending on the input stream at a generic service URL, we can auto-detect the right script to run. I've test-implemented a Twitter bot that can reply to messages that match a stored API command on Semantic CrunchBase (Inactive during the week-end, it's not tested enough. Stay tuned ;). Here is a teaser screenshot for next week:


Semantic Web by Example: Semantic CrunchBase

CrunchBase is now available as Linked Data including a SPARQL endpoint and a custom API builder based on SPARQLScript.
Update: Wow, these guys are quick, there is now a full RSS feed for CrunchBoard jobs. I've tweaked the related examples.

This post is a bit late (I've even been TechCrunch'd already), but I wanted to add some features before I fully announce "Semantic CrunchBase", a Linked Data version of CrunchBase, the free directory of technology companies, people, and investors. CrunchBase recently activated an awesome API, with the invitation to build apps on top of it. This seemed like the ideal opportunity to test ARC and Trice, but also to demonstrate some of the things that become possible (or much easier) with SemWeb technology.

Turning CrunchBase into a Linked Dataset

The CB API is based on nicely structured JSON documents which can be retrieved through simple HTTP calls. The data is already interlinked, and each core resource (company, person, product, etc.) has a stable identifier, greatly simplifying the creation of RDF. Ideally, machine-readable representations would be served from directly (maybe using the nicely evolving Rena toolkit), but the SemWeb community has a reputation of scaring away maintainers of potential target apps with complicated terminology and machinery before actually showing convincing benefits, so, at this stage (and given the nice API), it might make more sense to start with a separate site, and to present a selection of added values first.

For Semantic CrunchBase, I wrote a largely automated JSON2RDF converter, i.e. the initial RDF dataset is not using any known vocabs such as FOAF (or FOAFCorp). (We can INSERT mapping triples later, though.) Keeping most of the attribute names from the source docs (and mainly using just a single namespace) has another advantage besides simplified conversion: CrunchBase API users can more easily experiment with the SPARQL API (see twitter.json and twitter.rdf for a direct comparison).

An important principle in RDF land is the distinction between a resource and a page about a resource (it's very unlikely to hear an RDFer say "URLs are People" ;). This means that we need separate identifiers for e.g. Twitter and the Twitter description. There are different approaches, I decided to use (fake-)hash URIs which make embedding machine-readable data directly into the HTML views a bit more intuitive (IMHO):
  • /company/twitter#self denotes the company,
  • GETing the identifier resolves to /company/twitter which describes the company.
  • Direct RDF/XML or RDF/JSON can be retrieved by appending ".rdf" to the document URIs and/or via Content Negotiation.
This may sound a bit complicated (and for some reason RDFers love to endlessly discuss this stuff), but luckily, many RDF toolkits handle much of the needed functionality transparently.

The instant benefit of having linked data views is the possibility to freely explore the complete CrunchBase graph (e.g. from a company to its investors to their organizations to their relations etc.). However, the CrunchBase team has already done a great job, their UI already supports this functionality quite nicely, the RDF infrastructure doesn't really add anything here, functionality-wise. There is one advantage, but it's not obvious: An RDF-powered app can be extended at any time. On the data-level. Without the need for model changes (because there is none specified). And without the need for table tweaks (the DB schema is generic). We could, for example, enhance the data with CrunchBoard Jobs, DBPedia information, or profiles retrieved from Google's Social Graph API, without having to change a single script or table. (I switched to RDF as productivity booster some time ago and never looked back. The whole Semantic CrunchBase site took only a few person days to build, and most of the time was spent on writing the importer.) But let's skip the backstage benefits for now.

SPARQL - SQL for the Web

Tim Berners-Lee recently said that the success of the Semantic Web should be measured by the "level of unexpected reuse". While the HTML-based viewers support a certain level of serendipitous discovery, they only enable resource-by-resource exploration. It is not possible to spot non-predefined patterns such as "serial co-founders", or "founders of companies recently acquired". As an API provider, it is rather tricky to anticipate all potential use cases. On the CB API mailing list, people are expressing their interest in API methods to retrieve recent investments and acquisitions, or social graph fragments. Those can now only be coded and added by the API maintainers. Enter SPARQL. SPARQL, the protocol and query language for RDF graphs provides just this: flexibility for developers, less work for API providers. Semantic CrunchBase has an open SPARQL endpoint, but it's also possible to restrict/control the API while still using an RDF interface internally to easily define and activate new API methods. (During the last months I've been working for Intellidimension; they were using an on-request approach for AJAX front-ends. Setting up new API methods was often just a matter of minutes.)

With SPARQL, it gets easy to retrieve (almost) any piece of information, here is an example query that finds companies that were recently acquired:
SELECT DISTINCT ?permalink ?name ?year ?month ?code WHERE {
    ?comp cb:exit ?exit ;
          cb:name ?name ;
          cb:crunchbase_url ?permalink .

    ?exit cb:term_code ?code ;
          cb:acquired_year ?year ;
          cb:acquired_month ?month .
ORDER BY DESC (?year) DESC (?month)
(Query result as HTML)

Or what about a comparison between acquisitions in California and New York:
SELECT DISTINCT COUNT(?link_ca) as ?CA COUNT(?link_ny) as ?NY WHERE {
    ?comp_ca cb:exit ?exit_ca ;
             cb:crunchbase_url ?link_ca ;
             cb:office ?office_ca .
    ?office_ca cb:state_code "CA" .

    ?comp_ny cb:exit ?exit_ny ;
             cb:crunchbase_url ?link_ny ;
             cb:office ?office_ny .
    ?office_ny cb:state_code "NY" .

These are just some simple examples, but they (hopefully) illustrate how RDF and SPARQL can significantly improve Web app development and community support. But hey, there is more.

Semantic Mashups with SPARQLScript

SPARQL has only just become a W3C recommendation, and the team behind it was smart enough to not add too many features (even the COUNT I used above is not part of the core spec). The community is currently experimenting with SPARQL extensions, and one particular thing that I'm personally very interested in is the creation of SPARQL-driven mashups through something called SPARQLScript (full disclosure: I'm the only one playing with it so far, it's not a standard at all). SPARQLScript enables the federation of script block execution across multiple SPARQL endpoints. In other words, you can integrate data from different sources on the fly.

Imagine you are looking for a job in California at a company that is at a specific funding stage. CrunchBase knows everything about companies, investments, and has structured location data. CrunchBoard on the other hand has job descriptions, but only a single field for City and State, and not the filter options to match our needs. This is where Linked Data shines. If we find a way to link from CrunchBoard to CrunchBase, we can use Semantic Web technology to run queries that include both sources. And with SPARQLScript, we can construct and leverage these links. Below is a script that first loads the CrunchBoard feed of current job offers (only the last 15 entries, due to common RSS' limitations/practices, the use of e.g. hAtom could allow more data to be pulled in). In a second step, it uses the company name to establish a pattern join between CrunchBoard and CrunchBase, which then allows us to retrieve the list of matching jobs at (at least) stage-A companies with offices in California.
PREFIX cboard: <>
# refresh feed
if (${GET.refresh}) {
 # replaced <> with full feed
 LOAD <>
# let's query
$jobs = SELECT DISTINCT ?job_link ?comp_link ?job_title ?comp_name WHERE {
  # source: crunchboard, using full feed now
  GRAPH <> {
    ?job rss:link ?job_link ;
         rss:title ?job_title ;
         cboard:company ?comp_name .
  # source: full graph
  ?comp a cb:Company ;
        cb:name ?comp_name ;
        cb:crunchbase_url ?comp_link ;
        cb:office ?office ;
        cb:funding_round ?round .
  ?office cb:state_code "CA" .
  ?round cb:round_code "a" .
(You can test it, this really works.)

Now that we are knee-deep in SemWeb geekery anyway, we can also add another layer to all of this and
  • allow parameterized queries so that the preferred state and investment stage can be freely defined,
  • add a browser-based tool for the collaborative creation of custom API calls
  • add a template mechanism for human-friendly results

I'll write about this "Pimp My API" app at Semantic CrunchBase in the next post. Here are some example API calls that were already created with it:
A lot of fun, more to come.

SPARQLScript - Semantic Mashups made easy

SPARQLScript gets loops and output templating and can now be used to build simple semantic mashups.
What is a scripting language without loops, or a Web language without a template mechanism? Not really usable. Yesterday, I finally added the two missing core features to my SPARQLScript processor, and I'm excited about eventually being able to test the whole thing. This is just the beginning (there is no string concatenation yet, and no WHILE blocks), but with the basic infrastructure (and documentation) in place, it's time to start gathering feedback. I'm going to upgrade SPARQLBot in the next couple of days which should be a fun way to explore the possibilities (also, it were the bot's users who triggered the creation of SPARQLScript in the first place).

So, what is it actually good for?

Mid-term-ish, I'm dreaming of an alternative to increasingly non-RDFy specs such as RIF and OWL2 (there is definitely some need for them, they just don't seem to really work for me and my Web stuff). Things like crawling, smushing, or custom inference tasks based on wild mixtures of RDFS, OWL, and SKOS should be doable with SPARQLScript.

Simple agents are another use case, as SPARQLScript simplifies task federation across multiple endpoints and RDF sources.

What's working already today is the creation of simple mashups and widgets. Below is a script that integrates status notices from my twitter and feeds, and then creates an HTML "lifestream" snippet. The (live!) result is embedded at the bottom of this post.
# global prefix declarations
PREFIX dc: <>
PREFIX rss: <>

# the target store

# refresh feeds every 30 minutes
$up2date = ASK FROM <script-infos> WHERE {
  <script-infos> dc:date ?date . FILTER (?date > "${NOW-30min}")
IF (!$up2date) {
  # load feeds
  LOAD <>
  LOAD <>
  # remember the update time
  INSERT INTO <script-infos> { <script-infos> dc:date "${NOW}" }

# retrieve items
$items = SELECT * WHERE {
  ?item a rss:item ;
        rss:title ?title ;
        dc:date ?date .

# output template
"""<h4>My online lifestream:</h4>
FOR ($item in $items) {
  """<li><a href="${item.item}">${item.title}</a></li>"""

(S)mashups here we come :)

SPARQLScript Teaser

Basic support for variable assignments, placeholders, IF branches
I just managed to trick my experimental SPARQLScript parser into accepting simple IF-branches and placeholders. Here is an example of what is going to be possible with ARC soon (and yes, I know this snippet most probably won't excite anyone but me ;)
PREFIX dc: <>

# set the endpoint
ENDPOINT <endpoint.php>

# feed still fresh?
$current = ASK FROM <graph-updates> WHERE {
  <> dc:date ?date .
  FILTER (?date > ${now-1h})

# refresh feed and update graph log
IF (!$current) {
  LOAD <>
  INSERT INTO <graph-updates> { <> dc:date "${now}" }
(Parsed Structure)

The fun thing about the whole SPARQLScript experiment is that the parser (so far) is still below 200 LOC. A lot can be re-used from the official SPARQL Grammar, e.g. IF-blocks are really just:
Script ::= ( Query | PrefixDecl | EndpointDecl | Assignment | IFBlock )*
IFBlock ::= 'IF' BrackettedExpression '{' Script '}'

Implementing the actual SPARQLScript processing engine is of course more work than the parser, but I'm making progress there, too.

Major ARC revision: Talis platform-alignment, Remote Store, SPARQLScript

The latest ARC revision is aligned with Talis' platform structures, got a Remote Store component, and the start of a SPARQLScript implementation
The latest ARC release comes with a couple of non-trivial (but also not necessarily obvious) changes. The most significant (as it involves ARC's resource indexes) is the alignment with the structures used by the Talis platform. ARC's parser output and PHP or JSON formats are now directly processable by Talis' platform tools. The documentation has been updated already, you may have to adjust your code (basically just "s/val/value/" and "s/dt/datatype/") in a few places.

The second major addition is a Remote Store component (documentation still to come) that is inspired and based on Morten Frederiksen's great RemoteEndpointPlugin. The Remote Store works like Morten's Plugin, but supports SPARQL+' LOAD, INSERT, and DELETE (i.e. write/POST) operations.

The third addition is also the reason why the Remote Store (which can be used as a SPARQL Endpoint Proxy) became a core component. I've worked on a draft for a SPARQL-based scripting language during the last months, and the latest ARC revision includes an early SPARQLScript parser and a SPARQLScript processor that can run a set of routines against remote SPARQL endpoints. What's still missing before this stuff becomes more usable (apart from documentation ;) is output templating and some other essential features such as loops. I do have an early prototype running in a local SPARQLBot version, but I probably won't have it online in time for tomorrow's Semantic Scripting Workshop (that I'll try to attend remotely at least). This is really powerful (and fun) stuff that will be available soon-ish. Can't wait to replace my hard-coded inferencer with a set of easily pluggable SPARQLScript procedures.

Other tweaks and changes include a very early hCalendar extractor and a couple of bug fixes that were reported by (among others) the SMOB project maintainers.

As usual, thanks to all who sent in patches, bug reports, feature requests, and stress-tested ARC. I think we're pretty close to a release candidate now :-)


No Posts found