Posts tagged with: arc

Microdata, semantic markup for both RDFers and non-RDFers

RDF-in-HTML could have been so simple.
There's been a whole lot of discussion around Microdata, a new approach for embedding machine-readable information into forthcoming HTML5. What I find most attractive about Microdata is the fact that it was designed by HTMLers, not RDFers. It's refreshingly pragmatic, free of other RDF spec legacy, but still capable of expressing most of RDF.

Unfortunately, RDFa lobbyists on the HTML WG mailing list forced the spec out of HTML5 core for the time being. This manoeuver was understandable (a lot of energy went into RDFa, after all), but in my opinion very short-sighted. How many uphill battles did we have, trying to get RDF to the broader developer community? And how many were successful? Atom, microformats, OpenID, Portable Contacts, XRDS, Activity Streams (well, not really), these are examples where RDFers tried, but failed to promote some of their infrastructure into the respective solutions. Now: HTML5, where the initial RDF lobbying actually had an effect and lead to a native mechanism for RDF-in-HTML. Yes, native, not in some separate spec. This would have become part of every HTML5 book, any HTML developer on this planet would have learned about it. Finally a battle won. And what a great one. HTML.

But no, Microdata wasn't developed by an RDF group, so they voted it out again. Now, the really sad thing is, there could have been a solution that would have served everybody sufficiently well, both HTMLers and RDFers. The RDFa group recently realized that RDFa needs to be revised anyway, there is going to be an RDFa 1.1 which will require new parsers. If they'd swallowed their pride, they would most probably have been able to define RDFa 1.1 as a proper superset of Microdata.

Here is a short overview of RDF features supported by Microdata:
  • Explicit resource containers, via @itemscope (in RDFa, the boundaries of a resource are often implicitly defined by @rel or @typeof)
  • Subject declaration, via @itemid (RDFa uses @about)
  • Main subject typing, via @itemtype (RDFa uses @typeof)
  • Predicate declaration, via @itemprop (RDFa uses @property, @rel, and @rev)
  • Literal objects, via node values (RDFa also allows hidden values via @content)
  • Non-literal objects, via @href, @src, etc. (RDFa also allows hidden values via @resource)
  • Object language, via @lang
  • Blank nodes
I won't go into details why hiding semantics in RDFa will be penalized by search engines as soon as spammers discover the possibilities, why reusing RDF/XML's attribute names was probably not a smart move with regard to attracting non-RDFers, why the new @vocab idea is impractical, or why namespace prefixes, as handy as they are in other RDF formats, are not too helpful in an HTML context. Let's simply state that there is a trade-off between extended features (RDFa) and simplicity (Microdata). So, what are the core features that an RDFer would really need beyond Microdata:
  • the possibility to preserve markup, but probably not necessarily as an explicit rdf:XMLLiteral
  • datatypes for literal objects (I personally never used them in practice in the last 6 years that I've been developing RDF apps, but I can see some use cases)
Markup preservation is currently turned on by default in RDFa and can be disabled through @datatype in RDFa, so an RDFer-satisfying RDFa 1.1 spec could probably just be Microdata + @datatype + a few extended parsing rules to end up with the intended RDF. My experience with watching RDF spec creation tells me that the RDFa group won't pick this route (there simply is no "Kill a Feature" mentality in the RDF community), but hey, hope dies last.

I've been using Microdata in two of my recent RDF apps and the CMS module of (ahem, still not documented) Trice, and it's been a great experience. ARC is going to get a "microRDF" extractor that supports the RDF-in-Microdata markup below (Note: this output still requires a 2nd extraction process, as the current Microdata draft's RDF mechanism only produces intermediate RDF triples, which then still have to be post-processed. I hope my related suggestion will become official, but I seem to be the only pro-Microdata RDFer on the HTML list right now, so it may just stay as a convention):

Microdata:
<div itemscope itemtype="http://xmlns.com/foaf/0.1/Person">

  <!-- plain props are mapped to the itemtype's context -->
  <img itemprop="img" src="mypic.jpg" alt="a pic of me" />
  My name is <span itemprop="name"><span itemprop="nick">Alec</span> Tronnick</span>
  and I blog at <a itemprop="weblog" href="http://alec-tronni.ck/">alec-tronni.ck</a>.

  <!-- other RDF vocabs can be used via full itemprop URIs -->
  <span itemprop="http://purl.org/vocab/bio/0.1/olb">
    I'm a crash test dummy for semantic HTML.
  </span>
</div>
Extracted RDF:
@base <http://host/path/>
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix bio: <http://purl.org/vocab/bio/0.1/> .
_:bn1 a foaf:Person ;
      foaf:img <mypic.jpg> ;
      foaf:name "Alec Tronnick" ;
      foaf:nick "Alec" ;
      foaf:weblog <http://alec-tronni.ck/> ;
      bio:olb "I'm a crash test dummy for semantic HTML." .

Code.semsol.org - A central home for semsol code

Semsol gets code repositories and browsers
The code bundles on the ARC website are generated in an inefficient manual process, and each patch has to wait for the next to-be-generated zip file. The developer community is growing (there are now 600 ARC downloads each month), I'm increasingly receiving patches and requests for a proper repository, and the Trice framework is about to get online as well. So I spent last week on building a dedicated source code site for all semsol projects at code.semsol.org.

So far, it's not much more than a directory browser with source preview and a little method navigator. But it will simplify code sharing and frequent updates for me, and hopefully also for ARC and Trice developers. You can checkout various Bazaar code branches and generate a bundle from any directory. The app can't display repository messages yet (the server doesn't have bzr installed, I'm just deploying branches using the handy FTP option), but I'll try to come up with a work-around or an alternative when time permits.

Code Browser

Back from New York "Semantic Web for PHP Developers" trip

Gave a talk and a workshop in NYC about SemWeb technologies for PHP developers
/me at times square I'm back from New York, where I was given the great opportunity to talk about two of my favorite topics: Semantic Web Development with PHP, and (not necessarily semantic) Software Development using RDF Technology. I was especially looking forward to the second one, as that perspective is not only easier to understand for people from a software engineering context, but also because it is still a much neglected marketing "back-door": If RDF simplifies working with data in general (and it does), then we should not limit its use to semantic web apps. Broader data distribution and integration may naturally follow in a second or third step once people use the technology (so much for my contribution to Michael Hausenblas' list of RDF MalBest Practices ;)

The talk on Thursday at the NY Semantic Web Meetup was great fun. But the most impressive part of the event were the people there. A lot to learn from on this side of the pond. Not only very practical and professional, but also extremely positive and open. Almost felt like being invited to a family party.

The positive attitude was even true for the workshop, which I clearly could have made more effective. I didn't expect (but should have) that many people would come w/o a LAMP stack on their laptops, so we lost a lot of time setting up MAMP/LAMP/WAMP before we started hacking ARC, Trice, and SPARQL.

Marco brought up a number of illustrating use cases. He maintains an (inofficial, sorry, can't provide a pointer) RDF wrapper for any group on meetup.com, so the workshop participants could directly work with real data. We explored overlaps between different Meetup groups, the order in which people joined selected groups, inferred new triples from combined datasets via CONSTRUCT, and played with not-yet-standard SPARQL features like COUNT and LOAD.

And having done the workshop should finally give me the last kick to launch the Trice site now. The code is out, and it's apparently not too tricky to get started even when the documentation is still incomplete. Unfortunately, I have a strict "no more non-profits" directive, but I think Trice, despite being FOSS, will help me get some paid projects, so I'll squeeze an official launch in sometime soon-ish.

Below are the slides from the meetup. I added some screenshots, but they are probably still a bit boring without the actual demos (I think a video will be put up in a couple of days, though).

ARC Graph Gear Serializer Plugin

Patrick Murray-John created an ARC2 converter for Graph Gear visualizations
Patrick Murray-John (who is currently Semantifying the University of Mary Washington) just released a first version of an ARC2 converter for Graph Gear visualizations. Looks pretty cool.
Graph Gear visualization from RDF via ARC

ARC now also GPL-licensed

ARC is now available under the W3C Software or the GPL license
Arto Bendiken and Stéphane Corlosquet asked me to provide ARC also under the GPL (for Drupal, in addition to the current W3C Software License), so here you are.

ARC is already used by several modules that help turn Drupal into an RDF-powered CMS, for example the RDF API, the SPARQL extension, or the Calais module. The new license will make it easier for the Drupal community to directly bundle ARC with their RDF extensions. I guess that Drupal will have its own complete RDF toolkit one day, but it's great to see ARC being utilized for accelerating the development progress.

New ARC version (DB changes!)

ARC revision 2009-03-04 introduces some low-level changes.
I've just uploaded a new ARC revision (rev 2009-03-04). In preparation of a Store Optimizer (which will improve the RDF store's scalability and performance), I slightly changed the underlying MySQL table structure. Please backup your data before you upgrade. Here's sample code, it's not too complicated:
/* old ARC version */
$store->createBackup($backup_path); // make sure the directory is write-enabled
// on success: $store->drop();
/* new ARC */
$store->query('LOAD <' . $backup_path . '>');
(Note: Store settings, if used, are not part of the dump and have to be manually copied over)

Main changes in this version:
  • the (crappy) inferencer was removed, I'll add a rule-based system at some later stage
  • the store indexes can be defined via a "store_indexes" config option now. Default: array('sp (s,p)', 'os (o,s)', 'po (p,o)'). The previous stores had an additional index 'spo (s,p,o)'.
  • The store got an "extendColumns" method which changes the column types from MEDIUMINT to INT. You don't have to call this method explicitly, the tables will be auto-upgraded should your store reach ARC's previous 16M triples limit.
  • There is a new "store_write_buffer" config option (default: 2500, changed from 5000). This option let's you set the batch size of triples written to the MySQL tables. In certain situations, esp.with shared hosts or large literal objects, the "5000" was too much and led to MySQL rejecting the queries.
  • the toTurtle and toRDFXML methods (and associated methods in the
    Serializers) accept a 2nd "raw" parameter now, in case you don't want a full RDF document, but just the triples. (thx to Claudia Wagner for the suggestion)

If you have questions, just send them to the mailing list and I'll try to help.

RPointer - The resource described by this XPointer element

URIs for resources described in microformatted or poshRDF'd content
I'm often using/parsing/supporting a combination of different in-HTML annotations. I started with eRDF and microformats, more recently RDFa and poshRDF. Converting HTML to RDF usually leads to a large number of bnodes or local identifiers (RDFa is an exception. It allows the explicit specification of a triple's subject via an "about" attribute). Additionally, multi-step parsing a document (e.g. for microformats and then for eRDF) will produce different identifiers for the same objects.

I've searched for a way to create more stable, URI-based IDs. Mainly for two use cases: Technically, for improved RDF extraction, and practically for being able to subscribe to certain resource fragments in HTML pages, like the main hCard on a person's Twitter profile. The latter is something I need for Knowee.

The closest I could find (and thanks to Leigh Dodds for pointing me at the relevant specs) is the XPointer Framework and its XPointer element() scheme, which is defined as: ...intended to be used with the XPointer Framework to allow basic addressing of XML elements.
Here is an example XPointer element and the associated URI for my Twitter hCard:
element(side/1/2/1)
http://twitter.com/bengee#element(side/1/2/1)
We can't, however, use this URI to refer to me as a person (unless I redefine myself as an HTML section ;-). It would work in this particular case as I could treat the hCard as a piece of document, and not as a person. But in most situations (for example events, places, or organizations), we may want to separate resources from their respective representations on the web (and RDFers can be very strict in this regard). This effectively means that we cant use element(), but given the established specification, something similar should work.

So, instead of element(), I tweaked ARC to generate resource() URIs from XPointers. In essence:
The RPointer resource() scheme allows basic addressing of resources described in XML elements. The hCard mentioned above as RPointer:
resource(side/1/2/1)
http://twitter.com/bengee#resource(side/1/2/1)
There is still a certain level of ambiguity as we could argue about the exact resource being described. Also, as HTML templates change, RPointers are only as stable as their context. But practically, they work quite fine for me so far.

Note: The XPointer spec provides an extension mechanism, but it would have led to very long URIs including a namespace definition for each pointer. Introducing the non-namespace-qualified resource() scheme unfortunately means dropping out of the XPointer Framework ("This specification reserves all unqualified scheme names for definition in additional XPointer schemes"), so I had to give it a new name (hence "RPointer") and have to hope that the W3C doesn't create a resource() scheme for the XPointer framework.

RPointers are implemented in ARC's poshRDF and microformats extractors.

poshRDF - RDF extraction from microformats and ad-hoc markup

poshRDF is a new attempt to extract RDF from microformats and ad-hoc markup
I've been thinking about this since Semantic Camp where I had an inspiring dialogue with Keith Alexander about semantics in HTML. We were wondering about the feasibility of a true microformats superset, where existing microformats could be converted to RDF without the need to write a dedicated extractor for each format. This was also about the time when "scoping" and context issues around certain microformats started to be discussed (What happens for example with other people's XFN markup, aggregated in a widget on my homepage? Does it affect my social graph as seen by XFN crawlers? Can I reuse existing class names for new formats, or do we confuse parsers and authors then? Stuff like that).

A couple of days ago I finally wrote up this "poshRDF" idea on the ESW wiki and started with an implementation for paggr widgets, which are meant to expose machine-readable data from RDFa, microformats, but also from user-defined, ad-hoc formats, in an efficient way. PoshRDF can enable single-pass RDF extraction for a set of formats. Previously, my code had to walk through the DOM multiple times, once for each format.

A poshRDF parser is going to be part of one of the next ARC revisions. I've just put up a site at poshrdf.org to host the dynamic posh namespace. For now the site links to a possibly interesting by-product: A unified RDF/OWL schema for the most popular microformats: xfn, rel-tag, rel-bookmark, rel-nofollow, rel-directory, rel-license, hcard, hcalendar, hatom, hreview, xfolk, hresume, address, and geolocation. It's not 100% correct, poshRDF is after all still a generic mechanism and doesn't cover format-specific interpretations. But it might be interesting for implementors. The schema could be used to generate dedicated parser configurations. It also describes the typical context of class names so that you can work around scoping issues (e.g. the XFN relations are usually scoped to the document or embedded hAtom entries).

I hope to find some time to build a JSON exporter and microformats validator on top of poshRDF in the not too distant future. Got to move on for now, though. Dear Lazyweb, feel free to jump in ;)

Writing Inference Rules with SPARQLScript

SPARQLScript can be used for forward chaining, including string manipulations on the run.
In order to keep data structures in Semantic CrunchBase close to the source API, I used a 1-to-1 mapping between CrunchBase JSON keys and RDF terms (with only a few exceptions). This was helpful for people knowing the JSON API, but it wasn't easy to interlink the converted information with existing SemWeb data such as FOAF, or the various LOD sources.

SPARQLScript is already heavily used by the Pimp-My-API tool or the TwitterBot, but yesterday I added a couple of new features and finally had a go at implementing a (forward chaining) rule evaluator (for the reasons mentioned some time ago).

A first version ("LOD Linker") is installed on Semantic CB, with initially 9 rules (feel free to leave a comment here if you need some additional mappings). With SPARQLScript being a superset of SPARQL+, most inference scripts are not much more than a single INSERT + CONSTRUCT query (you can click on the form's "Show inference scripts" button to see the source code):
$ins_count = INSERT INTO <${target_g}>
  CONSTRUCT {?res a foaf:Organization } WHERE {
    { ?res a cb:Company }
    UNION { ?res a cb:FinancialOrganization }
    UNION { ?res a cb:ServiceProvider }
    # prevent dupes
    OPTIONAL { GRAPH ?g { ?res a foaf:Organization } }
    FILTER(!bound(?g))
  }
  LIMIT 2000
But with the latest SPARQLScript processor (ARC release 2008-09-12) you can run more sophisticated scripts, such as the one below, which infers DBPedia links from wikipedia URLs:
$rows = SELECT ?res ?link WHERE {
    { ?res cb:web_presence ?link . }
    UNION { ?res cb:external_link ?link . }
    FILTER(REGEX(?link, "wikipedia.org/wiki"))
    # prevent dupes
    OPTIONAL { GRAPH ?g { ?res owl:sameAs ?v2 } . }
    FILTER(!bound(?g))
  }
  LIMIT 500

$triples = "";
FOR ($row in $rows) {
  # extract the wikipedia identifier
  $id = ${row.link.replace("/^.*\/([^\/\#]+)(\#.*)?$/", "\1")};
  # construct a dbpedia URI
  $res2 = "http://dbpedia.org/resource/${id}";
  # append to triples buffer
  $triples = "${triples} <${row.res}> owl:sameAs <${res2}> . "
}

#insert
if ($triples) {
  $ins_count = INSERT INTO <${target_g}> { ${triples} }
}

(I'm using a similar script to generate foaf:name triples by concatenating cb:first_name and cb:last_name.)

Inferred triples are added to a graph directly associated with the script. Apart from a destructive rule that removes all email addresses, the reasoning can easily be undone again by running a single DELETE query against the inferred graph.

I'm quite happy with the functionality so far. What's still missing is a way to rewrite bnodes, I don't think that's already possible. But INSERT + CONSTRUCT will leave bnode IDs unchanged, so the inference scripts don't necessarily require URI-denoted resources.

Another cool aspect of SPARQLScript-based inferencing is the possibility to use a federated set of endpoints, each processing only a part of a rule. The initial DBPedia mapper above, for example, uses locally available wikipedia links. However, CrunchBase only provides very few of those. So I created a second script which can retrieve DBPedia identifiers for local company homepages, using a combination of local queries and remote ones against the DBPedia SPARQL endpoint (in small iterations and only for companies with at least one employee, but it works).

A Faceted Browser for ARC

One of the first Trice components is probably going to be a faceted browser for ARC
I'm going on vacation in a couple of days, and before that, I'm trying to tick off at least a few of the bigger items on my ToDo list. I was hoping for a first Trice preview (now that ARC is slowly getting stable), but this will have to wait until September. However, I managed to get another component that's been on my list for ages into a demo-able state today: A SPARQL/ARC-based faceted browser (test installation at Semantic CrunchBase).

faceted browser

It's an early, but working (I think ;) version. A template mechanism for the item previews is still missing, but I'm already quite happy with the facet column. The facets are auto-generated (based on statistical info and scope-detection), but it's also possible to define custom filters (for more complicated graph patterns, see screenshot below). Once again, SPARQLScript simplified development, thanks to its placeholders for parameterized queries.

faceted browser administration

I think I'm going to use the browser for a first Trice bundle. It's not too sophisticated, but builds on several core features such as request dispatching, RDF/SPARQL-based views and forms, basic AJAX calls, and cached template sections.

ARC Triples Visualizer Plugin

Luis Paulo implemented a graphviz access plugin for ARC
Luis Paulo created a graphviz plugin for ARC that generates .dot files and also SVG or bitmap graphics from triple sets (available features depend on the graphviz libraries installed on your machine). Thanks, Luis, great stuff!

ARC TriplesVisualizer Plugin output

Semantic Web by Example: Semantic CrunchBase

CrunchBase is now available as Linked Data including a SPARQL endpoint and a custom API builder based on SPARQLScript.
Update: Wow, these guys are quick, there is now a full RSS feed for CrunchBoard jobs. I've tweaked the related examples.

This post is a bit late (I've even been TechCrunch'd already), but I wanted to add some features before I fully announce "Semantic CrunchBase", a Linked Data version of CrunchBase, the free directory of technology companies, people, and investors. CrunchBase recently activated an awesome API, with the invitation to build apps on top of it. This seemed like the ideal opportunity to test ARC and Trice, but also to demonstrate some of the things that become possible (or much easier) with SemWeb technology.

Turning CrunchBase into a Linked Dataset

The CB API is based on nicely structured JSON documents which can be retrieved through simple HTTP calls. The data is already interlinked, and each core resource (company, person, product, etc.) has a stable identifier, greatly simplifying the creation of RDF. Ideally, machine-readable representations would be served from crunchbase.com directly (maybe using the nicely evolving Rena toolkit), but the SemWeb community has a reputation of scaring away maintainers of potential target apps with complicated terminology and machinery before actually showing convincing benefits, so, at this stage (and given the nice API), it might make more sense to start with a separate site, and to present a selection of added values first.

For Semantic CrunchBase, I wrote a largely automated JSON2RDF converter, i.e. the initial RDF dataset is not using any know vocabs such as FOAF (of FOAFCorp). (We can INSERT mapping triples later, though.) Keeping most of the attribute names from the source docs (and mainly using just a single namespace) has another advantage besides simplified conversion: CrunchBase API users can more easily experiment with the SPARQL API (see twitter.json and twitter.rdf for a direct comparison).

An important principle in RDF land is the distinction between a resource and a page about a resource (it's very unlikely to hear an RDFer say "URLs are People" ;). This means that we need separate identifiers for e.g. Twitter and the Twitter description. There are different approaches, I decided to use (fake-)hash URIs which make embedding machine-readable data directly into the HTML views a bit more intuitive (IMHO):
  • /company/twitter#self denotes the company,
  • GETing the identifier resolves to /company/twitter which describes the company.
  • Direct RDF/XML or RDF/JSON can be retrieved by appending ".rdf" to the document URIs and/or via Content Negotiation.
This may sound a bit complicated (and for some reason RDFers love to endlessly discuss this stuff), but luckily, many RDF toolkits handle much of the needed functionality transparently.

The instant benefit of having linked data views is the possibility to freely explore the complete CrunchBase graph (e.g. from a company to its investors to their organizations to their relations etc.). However, the CrunchBase team has already done a great job, their UI already supports this functionality quite nicely, the RDF infrastructure doesn't really add anything here, functionality-wise. There is one advantage, but it's not obvious: An RDF-powered app can be extended at any time. On the data-level. Without the need for model changes (because there is none specified). And without the need for table tweaks (the DB schema is generic). We could, for example, enhance the data with CrunchBoard Jobs, DBPedia information, or profiles retrieved from Google's Social Graph API, without having to change a single script or table. (I switched to RDF as productivity booster some time ago and never looked back. The whole Semantic CrunchBase site took only a few person days to build, and most of the time was spent on writing the importer.) But let's skip the backstage benefits for now.

SPARQL - SQL for the Web

Tim Berners-Lee recently said that the success of the Semantic Web should be measured by the "level of unexpected reuse". While the HTML-based viewers support a certain level of serendipitous discovery, they only enable resource-by-resource exploration. It is not possible to spot non-predefined patterns such as "serial co-founders", or "founders of companies recently acquired". As an API provider, it is rather tricky to anticipate all potential use cases. On the CB API mailing list, people are expressing their interest in API methods to retrieve recent investments and acquisitions, or social graph fragments. Those can now only be coded and added by the API maintainers. Enter SPARQL. SPARQL, the protocol and query language for RDF graphs provides just this: flexibility for developers, less work for API providers. Semantic CrunchBase has an open SPARQL endpoint, but it's also possible to restrict/control the API while still using an RDF interface internally to easily define and activate new API methods. (During the last months I've been working for Intellidimension; they were using an on-request approach for AJAX front-ends. Setting up new API methods was often just a matter of minutes.)

With SPARQL, it gets easy to retrieve (almost) any piece of information, here is an example query that finds companies that were recently acquired:
SELECT DISTINCT ?permalink ?name ?year ?month ?code WHERE {
    ?comp cb:exit ?exit ;
          cb:name ?name ;
          cb:crunchbase_url ?permalink .

    ?exit cb:term_code ?code ;
          cb:acquired_year ?year ;
          cb:acquired_month ?month .
}
ORDER BY DESC (?year) DESC (?month)
LIMIT 20
(Query result as HTML)

Or what about a comparison between acquisitions in California and New York:
SELECT DISTINCT COUNT(?link_ca) as ?CA COUNT(?link_ny) as ?NY WHERE {
    ?comp_ca cb:exit ?exit_ca ;
             cb:crunchbase_url ?link_ca ;
             cb:office ?office_ca .
    ?office_ca cb:state_code "CA" .

    ?comp_ny cb:exit ?exit_ny ;
             cb:crunchbase_url ?link_ny ;
             cb:office ?office_ny .
    ?office_ny cb:state_code "NY" .
}
(Results)

These are just some simple examples, but they (hopefully) illustrate how RDF and SPARQL can significantly improve Web app development and community support. But hey, there is more.

Semantic Mashups with SPARQLScript

SPARQL has only just become a W3C recommendation, and the team behind it was smart enough to not add too many features (even the COUNT I used above is not part of the core spec). The community is currently experimenting with SPARQL extensions, and one particular thing that I'm personally very interested in is the creation of SPARQL-driven mashups through something called SPARQLScript (full disclosure: I'm the only one playing with it so far, it's not a standard at all). SPARQLScript enables the federation of script block execution across multiple SPARQL endpoints. In other words, you can integrate data from different sources on the fly.

Imagine you are looking for a job in California at a company that is at a specific funding stage. CrunchBase knows everything about companies, investments, and has structured location data. CrunchBoard on the other hand has job descriptions, but only a single field for City and State, and not the filter options to match our needs. This is where Linked Data shines. If we find a way to link from CrunchBoard to CrunchBase, we can use Semantic Web technology to run queries that include both sources. And with SPARQLScript, we can construct and leverage these links. Below is a script that first loads the CrunchBoard feed of current job offers (only the last 15 entries, due to common RSS' limitations/practices, the use of e.g. hAtom could allow more data to be pulled in). In a second step, it uses the company name to establish a pattern join between CrunchBoard and CrunchBase, which then allows us to retrieve the list of matching jobs at (at least) stage-A companies with offices in California.
PREFIX cboard: <http://www.crunchboard.com>
ENDPOINT <http://cb.semsol.org/sparql>
# refresh feed
if (${GET.refresh}) {
 # replaced <http://feeds.feedburner.com/CrunchboardJobs> with full feed
 LOAD <http://www.crunchboard.com/rss/affiliate/crunchboardrss_all.xml>
}
# let's query
$jobs = SELECT DISTINCT ?job_link ?comp_link ?job_title ?comp_name WHERE {
  # source: crunchboard, using full feed now
  GRAPH <http://www.crunchboard.com/rss/affiliate/crunchboardrss_all.xml> {
    ?job rss:link ?job_link ;
         rss:title ?job_title ;
         cboard:company ?comp_name .
  }
  # source: full graph
  ?comp a cb:Company ;
        cb:name ?comp_name ;
        cb:crunchbase_url ?comp_link ;
        cb:office ?office ;
        cb:funding_round ?round .
  ?office cb:state_code "CA" .
  ?round cb:round_code "a" .
}
(You can test it, this really works.)

Now that we are knee-deep in SemWeb geekery anyway, we can also add another layer to all of this and
  • allow parameterized queries so that the preferred state and investment stage can be freely defined,
  • add a browser-based tool for the collaborative creation of custom API calls
  • add a template mechanism for human-friendly results

I'll write about this "Pimp My API" app at Semantic CrunchBase in the next post. Here are some example API calls that were already created with it:
A lot of fun, more to come.

SPARQLScript - Semantic Mashups made easy

SPARQLScript gets loops and output templating and can now be used to build simple semantic mashups.
What is a scripting language without loops, or a Web language without a template mechanism? Not really usable. Yesterday, I finally added the two missing core features to my SPARQLScript processor, and I'm excited about eventually being able to test the whole thing. This is just the beginning (there is no string concatenation yet, and no WHILE blocks), but with the basic infrastructure (and documentation) in place, it's time to start gathering feedback. I'm going to upgrade SPARQLBot in the next couple of days which should be a fun way to explore the possibilities (also, it were the bot's users who triggered the creation of SPARQLScript in the first place).

So, what is it actually good for?

Mid-term-ish, I'm dreaming of an alternative to increasingly non-RDFy specs such as RIF and OWL2 (there is definitely some need for them, they just don't seem to really work for me and my Web stuff). Things like crawling, smushing, or custom inference tasks based on wild mixtures of RDFS, OWL, and SKOS should be doable with SPARQLScript.

Simple agents are another use case, as SPARQLScript simplifies task federation across multiple endpoints and RDF sources.

What's working already today is the creation of simple mashups and widgets. Below is a script that integrates status notices from my twitter and identi.ca feeds, and then creates an HTML "lifestream" snippet. The (live!) result is embedded at the bottom of this post.
# global prefix declarations
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX rss: <http://purl.org/rss/1.0/>

# the target store
ENDPOINT <http://arc.semsol.org/demos/endpoint/>

# refresh feeds every 30 minutes
$up2date = ASK FROM <script-infos> WHERE {
  <script-infos> dc:date ?date . FILTER (?date > "${NOW-30min}")
}
IF (!$up2date) {
  # load feeds
  LOAD <http://twitter.com/statuses/user_timeline/9516642.rss>
  LOAD <http://identi.ca/bengee/rss>
  # remember the update time
  INSERT INTO <script-infos> { <script-infos> dc:date "${NOW}" }
}

# retrieve items
$items = SELECT * WHERE {
  ?item a rss:item ;
        rss:title ?title ;
        dc:date ?date .
} ORDER BY DESC(?date) LIMIT 8;

# output template
"""<h4>My online lifestream:</h4>
<ul>"""
FOR ($item in $items) {
  """<li><a href="${item.item}">${item.title}</a></li>"""
}
"</ul>"

(S)mashups here we come :)

An RDF Parser for Google's Social Graph API JSON

ARC gets a parser for JSON returned by Google's SG API
First, the usual credits to Morten (and also Dan), who already suggested to extract RDF from Google's SG API results some time ago.

Some work will be needed for a complete mapping of the detailed information coming out of the API. Not only because the data is not always fully accurate (the API still thinks that Ian Davis and I are the same person) but also because the claims are document-oriented while most SG-related RDF vocabs are person-centric.

However, for any given URL somehow associated with a person, the API returns a set of identifiers that are very likely to lead to related data. So, for an RDF toolkit, these pointers are often already sufficient to send out its RDF extractors and enrich the local dataset. The SG API Parser that was now added to ARC (revision 2008-07-15) is still pretty basic, but it will generate rdfs:seeAlso triples for the canonical_mapping's value (as subject) and every mentioned HTTP identifier (as object).

I'm working on more low-level/direct RDF mappings for POSH formats such as XFN, those could simplify detailed triple extraction (w/o too much of the current person --homepage-> document indirection) from the API results.

Using the new parser in ARC is identical to working with any other syntax. The format detector will auto-include the necessary components. Just call
$parser->parse("http://socialgraph.apis.google.com/lookup?q=example.com&...")
or
$store->query("LOAD <http://socialgraph.apis.google.com/lookup?q=example.com&...>")

SPO(G) in ARC

Streaming backups via the SPOG SPARQL result format in ARC
The Urban Dictionary describes SPOG as "super pimped out gangsta" or as "a weapon that (...) had a fusion reactor as a power source". Sorry to disappoint you, neither has become part of ARC. Nevertheless, the SPOG I mean is quite powerful, too. It is a constrained SPARQL XML result format from SELECT queries that was proposed by Morten Frederiksen a few months ago. SPOG enables streaming store backups/dumps, and being another RDF serialization, it can be used for streamed loading as well. Support for SPOG was added in the latest revision (2008-07-02) and extends the store and the endpoint components:
  • The store got a dump() method that stream-outputs SPOG from all quads, and a createBackup($path, $alternative_query) method to write a SPOG dump (or custom SPO(G) query result) to a local file
  • The SPARQL endpoint feature list accepts "dump" as a new read operation
  • The SPARQL endpoint accepts "DUMP" as a query type now ("DUMP" also works via the internal query() method)
  • The format detector accepts SPOG XML as an RDF format now, SPARQL+ queries will work fine with LOAD <some-spog-file.srx>. (There is now a dedicated SPOG parser for streaming LOADs.)

These additions should simplify graph exchange and store replication quite a bit.

Morten++ for the idea and an initial implementation.

Documentation - Release Notes

Archives/Search

YYYY or YYYY/MM

Feeds