finally a bnode with a uri

Posts tagged with: semweb

bnode.org upgrade to SemSol

Switching to SemSol
If you can read this, my server move and blogging platform upgrade was successful.
Welcome to the new bnode, now powered by an entirely SPARQL-based CMS.

Flaws in iX article about "Semantic Web versus Web 2.0"

The title says it all.
The iX article Denny and I mentioned recently stirs up some discussion. Patrick Danowski asks for some details about the flaws in the article. I wanted to comment on his blog, but it turned into a whole post which I'll just copy below (bah, in german again):

Man muss jedem SemWeb-kritischen Artikel sicherlich dahingehend zustimmen, dass die Vermarktung und Informationsversorgung bisher nicht sonderlich gut organisiert ist. Dementsprechend ist es nachvollziehbar, wenn ein Wenig-Involvierter wie Cai Ziegler falsche Schl�sse zieht. Hoffentlich wird dies die gerade gegr�ndete Semantic Web Education and Outreach Interessensgruppe des W3Cs �ndern.

Das Hauptproblem besteht darin, dass Cai Ziegler die Herangehensweise des Semantic Webs nicht vollst�ndig verstanden zu haben scheint (oder lediglich Spa� an Flamewars hat). Es geht nicht darum, einen "Nachfolger" des Webs zu entwickeln. Vielmehr versucht die Semantic-Web-Initiative, Technologien zu spezifizieren, mit denen sich die Inhalte des bestehenden Webs besser weiterverarbeiten lassen (insbesondere, indem die Semantik explizit gemacht wird). Wenn nun das bestehende Web (nennen wir es spa�eshalber mal "Web 2.0") Informations-Quellen wie Folksonomien oder Microformats hervorbringt, hat das mit dem Kerngebiet des SemWebs (explizite Semantik) gar nichts zu tun. Vielmehr vergr��ert es den Pool an Daten, auf die evtl. sp�ter mit SemWeb-Tools zugegriffen werden kann. Ein aktuelles Beispiel, das dieses "versus"-Argument sehr sch�n ad absurdum f�hrt, ist ein Entwurf der GRDDL-Arbeitsgruppe, in der die RDF-Community zusammen mit der Microformats-Community an der Spezifizierung eines Mechanismus' gearbeitet hat, Microformats in RDF zu transformieren, um sie dann mit der Abfragesprache SPARQL zu integrieren. Die Semantic-Web-Vision (so man denn von einer sprechen m�chte) erstreckt sich �ber etliche Schichten, die alle auf "normalen" Web-Techniken (IRIs, HTTP, etc.) basieren. Ob man sich mit automatisierten Agenten 'rumschlagen m�chte, kann man sich in 10 Jahren �berlegen, dass bei einer Darstellung des Semantic Webs immer der komplette Layer-Stack gezeigt wird, ist sicherlich nicht besonders klug. Die eigentlichen "Semantik"-Schichten sind vielleicht nicht trivial, aber auch nicht viel komplizierter als die Programmierung eines Atom-Stores oder universellen Microformat-Parsers.

Ein paar konkrete Fehler im iX-Artikel:
  • im Kontext des oben genannten ist die Aussage "jeder auf seine g�nzlich eigene Art" unsinnig. Entweder man ist vern�nftig und versucht nicht "Maschinen-interpretierbare Daten im Web" mit "Der Anwender im Mittelpunkt" zu vergleichen, oder man f�hrt den Vergleich auf technischer Ebene durch und erkennt, dass kein Widerspruch besteht.
  • die ganze Zeit wird im Artikel versucht, einen Interessenskonflikt zu konstruieren ("Erbfolgezwist", "Oberhand" etc.), gleichzeitig rudert der Autor wieder zur�ck, und behauptet, die Entw�rfe k�nnten voneinander profitieren. Was denn nun?
  • Zum Ontologie-Spektrum von RDF (und RDF ist nicht einmal mit dem Semantic Web gleichzusetzen) geh�ren derzeit SKOS, RDFS, und OWL. Mit SKOS lassen sich Folksonomien, mit RDFS Hierarchien, und mit OWL relativ komplexe Modelle darstellen. Cai Ziegler konstruiert ein Taxonomie-versus-Folksonomie-Argument (und definiert hierbei Taxonomien auch noch falsch als reines "is-a"-Modell) und schlie�t, dass Taxonomien nicht funktionieren, Folksonomien aber schon, diese aber nicht SemWeb sind. Au�erdem behauptet er, eine Ontologie m�sse eine Dom�ne umfassend definieren und kapseln. Aber genau das ist es ja, was mit Web-Ontologien (SKOS, RDFS oder OWL) gerade nicht n�tig ist.
  • Queso (ein RDF-basierter Atom-Store), eigene Erfahrungen bei der Kombination von Microsofts LiveClipboard mit SPARQL, RDF-basierte Web-CMS und auch die Kombination von Microformats+eRDF+SPARQL machen meines Erachtens eine Menge Sinn und zeigen einiges an Potential. Leider hat Cai Ziegler von diesen aktuelleren Entwicklungen nichts mitbekommen, aber wie gesagt ist ihm da kein Vorwurf zu machen, das Marketing m�ssen wir SemWebber dezent verbessern.
  • "Der gro�e Wurf blieb aus": Noch so ein politisch gef�rbtes Statement. SPARQL, welches die ganze RDF-Welt erst dem Normal-Entwickler zug�nglich macht, zusammen mit SKOS, das Trends wie Folksonomien aufgreift, sind noch mitten im W3C-Prozess. Auch eRDF und GRDDL f�r Microformats sind relativ neu. Die Vergangenheitsform ist sicherlich nicht angebracht. Gegen das "die brauchen schon ewig"-Argument kann man schmunzelnd anf�hren, dass das "Web 2.0" ja auch nicht quasi �ber Nacht entstanden ist (wie gerne behauptet wird). Lediglich der Name ist noch relativ (mittlerweile auch schon wieder 2 Jahre) jung. Kurz vor dem Dot-Com-Doom waren doch bereits myBlaBlaBla-Portale der (vermeintlich) gro�e Renner (der Anwender im Mittelpunkt), Amazon's "Collective Intelligence" gibt's seit 1999, eBays Longtail-Ausnutzung und Ratings seit 1996. Blogs und Wikis sind uralt. Ich habe 1999 selber f�r ein Startup gearbeitet, das so etwas wie Netvibes entwickelt hat (damaliger Marktf�hrer war onepage.com). Es dauert immer, bis sich technische Entwicklungen durchsetzen. Der Ruf nach mehr offenen Daten ist erst in j�ngerer Zeit lauter geworden. Der "gro�e Wurf" konnte also wohl eher noch gar nicht stattfinden.
  • "Weblogs sind Web 2.0". Richtig, und nutzen strukturierte Formate zur Syndizierung. Ein weiteres Beispiel f�r die Absurdit�t der "Versus"-Debatte
  • Semantische Erweiterungen f�r Wikipedia werden als "noch keins in der Phase der Umsetzung" bezeichnet, was ja auch irgendwie suggeriert, das das Ganze nicht funktioniert (hat). Ist aber auch alles noch brandneu und ein sch�nes Beispiel, wie sich SemWeb-Ans�tze an vielen Stellen mit relativ wenig Aufwand integrieren lassen.
  • Tagging vs. RDF (im del.icio.us-Kontext): siehe SKOS, selbst das "rel-tag"-Microformat ist nur ein paar Zeilen Code von RDF entfernt.
  • "Folksonomies stehen im krassen Gegensatz zu [...] den Grundfesten des Semantic Web": Das ist nun leider v�llig falsch. Ob ich statistische Auswertungen �ber gesammelte Tags durchf�hre oder nicht, ist unabh�ngig von Semantic-Web-Technologien. SKOS-Folksonomien w�rden aber z.B. die Zusammenf�hrung ausgew�hlter Tags �ber Service-Grenzen hinweg (z.B. del.icio.us und flickr) erm�glichen (Erweiterung/Erg�nzung des bestehenden Webs, nicht Ersatz!). Und wer mal mit 'nem Viel-Nutzer von del.icio.us gesprochen hat, wird feststellen, dass bessere Strukturierungsm�glichkeiten und Portabilit�t der Tags auf der Wunschliste ganz oben stehen. Hoppla.
  • "Web 2.0 schl�gt das Semantic Web auf eigenem Boden". Hierf�r wird DMOZ als Beispiel angef�hrt und erneut das inkorrekte Taxonomie-Beispiel als Begr�ndung verwendet. DMOZ ist nur leider nicht wirklich ein SemWeb-Projekt, das verteilte Informationen integriert, sondern ein zentralisiertes Verzeichnis (das lediglich eine veraltete RDF-Version als Export-Format verwendet). Es fehlt auch das Gegenbeispiel. Falls del.icio.us gemeint ist: dieses exportiert seine Listen als RSS und verwendet spezielle Auszeichnungen, um Tags in den Feeds explizit zu machen. Wunderbarer Input f�r ein semantisches Web.
  • "das mit dem Begriff Semantic Web assoziierte Gedankengut in seinen Grundfesten ersch�ttert". Abgesehen von der schr�gen Formulierung zeigt sich hier wohl eher, was Cai Ziegler mit Semantic Web assoziiert, und leider f�hren derartige Artikel dazu, dass noch weniger Informierte diese Assoziationen �bernehmen.
</rant>

ZGDV Talk: Semantic Web and Web 2.0

Talk at ZGDV Darmstadt about Semantic Web and Web 2.0
pipe dream vs. piece of jargon There is a lot of Web 2.0 media buzz at the moment, many people seem to feel a presence [of enthusiasm] they haven't felt since... well, Obi-Wan Dot Com, I guess.

However, there also seems to be a misconception about Web 2.0 (whatever that term may mean to you) "replacing" the Semantic Web effort, or that - as written in an article in the current iX issue - the Semantic Web "was a failure", and "lost against" Web 2.0.

Yesterday, I gave a talk (slides, mostly in german, I'm afraid) at a ZGDV Conference in Darmstadt and tried to demystify this SemWeb "versus" Web 2.0 perception a little bit. I tried to show that the concepts are not that easy to compare really, that the technology behind actually follows common goals, and that the whole discussion shouldn't be taken too seriously. Of course there is a mind share (and developer market) contest, but that's more or less all it boils down to when you analyse the "competition". See for example the rather childish "we are lowercase semantic web" claim of microformats. They are cool, pragmatic, and completely in line with the Semantic Web idea ("semantics for structured data on the web"). Hopefully we'll soon see some apps that demonstrate how the whole market could gain a lot if people would work together more actively (the GRDDL activity is a great example) instead of wasting their energy in politics (IMOSHO).

The talk itself went fine (I think), too speedy towards the end as I ran out of time (as usual), where I surely lost a few people. But feedback was positive (as opposed to last webmonday, where I introduced the idea behind paggr and felt like Marty McFly after his guitar solo in BTTF ;).

Minority Report starring Leo SauermannLeo blogged, too, including funny photos of me (in hacker camouflage). I took some of him in return (see below). He gave an entertaining talk - on Semantic Desktops as you might've guessed - and started the whole thing with a "personal user interfaces in hollywood movies" quiz game, successfully waking up everyone in the room with mozartkugeln as incentive.
Leo presents Nepomuk

CMS dev communities starting to take stock in RDF

DrupalCon Brussels Report
A spontaneous invitation to DrupalCon got me driving to Brussels yesterday to finally meet the CivicActions folks I've been working for during the last months. Unfortunately, I missed Jonathan Hendler's NINA presentation about adding ARC's SPARQL API to Drupal for building a faceted browser, but we chatted quite a bit about it after lunch. I still have to learn a lot about Drupal, but one of the really interesting things is that it provides an extension called Content Construction Kit (CCK) that simplifies defining flexible forms and their elements. Drupal generates an HTML page for every resource ("node" in Drupal-speak) created via CCK. The thing that's missing is mapping the structured CCK nodes to RDF to enable optimized SPARQL querying while keeping editing simple and integrated. We discussed the potential of not only ex- but also importing RDF data into CCK. And how cool it could be to directly convert RDFS/OWL to CCK field definitions. Good news is that there are several hooks to RDF-enhance Drupal without running into synchronization issues or forcing the replacement of built-in components.

CivicActions was a gold sponsor and Dan Robinson introduced me to some of the core Drupal developers. And as it turned out, some of them are already thinking about direct RDF support for Drupal (partly triggered by TimBL using Drupal for blogging, partly because Drupal's internal structure isn't really far away from a graph-based model). I'm aware of three efforts now to add RDF to Drupal in some way, there may be more.

But it's not only the Drupal crowd which is looking at SemWeb technology. At lunch, I met Johan Janssens, lead developer of the Mambo spin-off Joomla!, who told me about a SemWeb project proposal for their 2006 Google Summer of Code. (There is another one in the ideas section.) The project took more than just this summer (welcome to RDF development ;), and the outcome is not going to be added to Joomla! anytime soon, but obviously the PHP community is getting aware of RDF's potential benefits and is starting to play with RDF, OWL, and SPARQL. And it's approaching the SemWeb from a practical point of view which just can't be bad.

Return of the Challenge

Semantic Web Challenge 2006
Just a reminder: The call for the SemWeb Challenge 2006 ends this week (Friday, 14th). If you are working on an RDF app, consider participating. It's a lot of fun, you'll get incredibly useful feedback, and the organizers clearly deserve loads of submissions for running this community event!

Unfortunately, I'm not going to participate this year as I have to focus on coding during the next months in order to push my SPARQL CMS thingy to version 1.0. And maybe I should stop listening to StarWars tunes while I'm working on layout stuff..

SemSol logo sample

Web Clipboard: Adding liveliness to "Live Clipboard" with eRDF, JSON, and SPARQL.

Combining Live Clipboard with eRDF and SPARQL
Some context: In 2004, Tim Berners-Lee mentioned a potential RDF Clipboard as a user model which allowed copying resource descriptions between applications. Depending on the type of the copied resource, the target app would trigger appropriate actions. (See also the ESW wiki and Danny's blog for related links and discussion.)

I had a go at an "RDF data cart" last year which allowed you to "1click"-shop resource descriptions while surfing a site. Before leaving, you could "check out" the collected resource descriptions. However, the functionality was limited to a single session, the resource pointers didn't use globally valid identifiers.

Then, a couple of months ago, Ray Ozzie announced Live Clipboard, which uses a neat trick to access the operating system's clipboard for Copy & Paste operations across web pages.

Last week, I finally found the time to combine the Live Clipboard trick with the stuff I'm currently working on: A Semantic Publishing Framework, Embeddable RDF, and SPARQL. If you haven't heard of the latter two: eRDF is a microformats-like way to embed RDF triples in HTML, SPARQL is the W3C's protocol and query language for RDF repositories.

What I came up with so far is a Web Clipboard that works similar to Live Clipboard (I'm actually thinking about making it fully compatible), with just a few differences:

  • Web Clipboard uses a hidden single-line text input instead of a textarea which seemed to be a little bit easier to insert into the document structure, and it makes it work in Opera 8.5. The downside is that input fields don't allow multi-line content to be pasted (which is not needed by Web Clipboard, but will be necessary if I want to add Live Clipboard compatibility)
  • Web Clipboard doesn't paste complete resource descriptions, but only pointers to those. This makes it possible to e.g. copy a resource from a simple list of person's names, and display full contact details after a paste operation. (See the demo for an example which does asynchronous calls to a SPARQL endpoint). This "pass by reference" enables things like distributed address books or calendars where changes at one place could be automatically updated in the other apps.
  • Instead of XML, Web Clipboard uses a small JSON object which can simply be evaluated by JavaScript applications, or split with a basic regular expression. The pasted object contains 1) a resource identifier, and 2) an endpoint where information about the identified resource is available. The endpoint information consists of a URL and a list of specifications supported by the endpoint.

Complete documentation is going to be up at the clipboard site, but I'll first see if I can make things Live Clipboard-compatible (and I'll be travelling for the rest of the week). Here is a simple explanation how the current SPARQL demo works:

Apart from adding a small javascript library and a CSS file to the page, I specified the clipboard namespace and a default endpoint to be used for any resource pointer embedded in the page (this is eRDF syntax):
<link rel="schema.webclip" href="http://webclip.web-semantics.org/ns/webclip#" />
<link rel="webclip.endpoint" href="http://www.sparqlets.org/clipboard/sparql" />

Then I embedded a sparqlet that generates the list of Planet RDF bloggers (this is done server-side). The important thing is that the HTML contains eRDF hooks like this:
<div id="agent0" class="-webclip-Res">
  <span class="webclip-resID" title="_:bb1ed0e67fdb042619f2f20fdc479c3af_id2245787"></span>
  <span class="foaf-name">Bob DuCharme</span>
  <a rel="foaf-weblog" href="http://www.snee.com/bobdc.blog/">bobdc.blog by Bob DuCharme</a>
</div>

Ideally, the resource ID (webclip:resID, here again in eRDF notation) is a URI or some other stable identifier. The queried endpoint, however, obviously couldn't find a URI for the rendered resource, so it only provided a bnode ID. This is ok for the SPARQL endpoint the clipboard uses, though. The "foaf:weblog" information could be used to further disambiguate the resource identifier, the demo doesn't use it, however.

(The nice thing about eRDF-encoded hooks is that the information can be read by any HTTP- and eRDF-enabled client, the clipboard functionality could be implemented without having to load the page in a browser.)

Now, when the page is displayed, an onload-handler instantiates a JavaScript Web Clipboard which automatically adds an icon for each resource identified by the "webclip:Res/webvlip:resID"-hooks.

When the icon is clicked, the resource pointer JSON object is created and can be copied to the system's clipboard. It currently looks like this (on a single line):
{
 resID : "_:bb1ed0e67fdb042619f2f20fdc479c3af_id2245787",
 endpoint: {
  url: "http://www.sparqlets.org/clipboard/sparql",
specs: [ "http://www.w3.org/TR/rdf-sparql-protocol/", "http://bob.pythonmac.org/archives/2005/12/05/remote-json-jsonp/" ]
} }

We can see that the clipboard uses the default endpoint mentioned at the document level as the embedded hook didn't specify a resource-specific endpoint. We can also see that the endpoint supports two specs, namely the SPARQL protocol and JSONP.

When this JSON object is pasted to another clipboard section, the onpaste-handler can decide what to do. In the demo, any paste section will make an asynchronous On-Demand-JavaScript call to the resource's SPARQL endpoint to retrieve a custom resource representation. The "Latest blog post" section uses a pre-defined callback, but this can be overwritten (as e.g. done by the "Resource Description" section which uses a custom function to display results).

I've added a playground area to the clipboard site where you can create your own clipboard sections. Give it a try, it's not too complicated. You can even bookmark them.

Here is an example JavaScript snippet that adds a clipboard section to a clipboard-enabled page with an 'id="resultCountSection"' HTML element:
window.clipboard.addSection({
  id : "resultCountSection",
resIDVar : "myRes",
query : "SELECT ?knowee WHERE "+ "{"+ " ?myRes <http://xmlns.com/foaf/0.1/knows> ?knowee . "+ "}"+ " LIMIT 50", callback : function(qr){ var rows=(qr.results["bindings"]) ? qr.results.bindings : []; var result="The pasted resource seems to know "+ rows.length+" persons."; /* update paste area */ this.item.innerHTML=result; /* refresh clipboard */ window.clipboard.activate(); } }); window.clipboard.activate();

Something like this is all that will be needed for the final clipboard. No microformats parsing or similar burdens (although you could use the Web Clipboard to process microformats). The Clipboard's definition of an endpoint is rather open, too. An RSS file could be considered an endpoint as well as any other Web-accessible document or API.

ARC Embedded RDF (eRDF) Parser for PHP

Announcing eRDF support for ARC + an eRDF/RDFa comparison
Update: The current RDFa primer is *not* broken wrt to WebArch, the examples were fixed two weeks ago. I've also removed the "no developer support" rant, just received personal support ;-)

While searching for a suitable output format for a new RDF framework, I've been looking at the various semantic hypertext approaches, namely microformats, Structured Blogging, RDFa, and Embedded RDF (eRDF). Each one has its pros and cons:

Microformats:
  • (+) widest deployment so far
  • (+) integrate nicely with current HTML and CSS
  • (-) centralized project, inventing custom microformats is discouraged
  • (-) don't scale, the number of MFs will either be very limited, or sooner or later there will be class name collisions

Structured Blogging:
  • (+) a large number of supporters (at least potentially, the supporters list is huge, although this doesn't represent the available tools)
  • (+) not a competitor, but a superset of microformats
  • (-) the metadata is embedded in a rather odd way
  • (-) the metadata is repeated
  • (-) the use cases are limited (e.g. reviews, events, etc)

RDFa:
  • (+) follows certain microformats principles (e.g. "Don't repeat yourself")
  • (+) freely extensible
  • (+) All resource descriptions (e.g. for events, profiles, products, etc.) can be extracted with a single transformation script
  • (+) RDF-focused
  • (+) W3C-supported
  • (-) Not XHMTL 1.0 compliant, it will take some time before it can be used in commercial products or picky geek circles
  • (-) The default datatype of literals is rdf:XMLLiteral which is wrong for most deployed properties

eRDF:
  • (+) follows the microformats principles
  • (+) freely extensible
  • (+) All resource descriptions (e.g. for events, profiles, products, etc.) can be extracted with a single transformation script
  • (+) uses existing markup
  • (+) XHTML 1.0 compliant
  • (+) RDF-focused
  • (-) Covers only a subset of RDF
  • (-) Does not support XML literals

So, both RDFa and eRDF seem like good candidates for embedding resource descriptions in HTML. The two are not really compatible, though, it is not easily possible to create a superset which is both RDFa and eRDF. However, my publishing framework is using a Wiki-like markup language (M4SH) which is converted to HTML, so I can add support for both approaches and make the output a configuration option. Maybe it's even possible to create a merged serialization without confusing transformers.

I'll surely have another look at RDFa when there is better deployment potential. For now, I've created a M4SH-to-eRDF converter (which is going to be available as part of the forthcoming SemSol framework), and an eRDF parser that can generate RDF/XML from embedded RDF. I've also added some extensions to work around (plain) eRDF's limitations, the main one being on-the-fly rewriting of owl:sameAs assertions to allow full descriptions of remote resources, e.g.
<div id="arc">
  <a rel="owl-sameAs" href="http://example.com/r/001#001"></a>
  <a rel="doap-maintainer" href="#ben">Benjamin</a>
</div>
is automatically converted to
<http://example.com/r/001#001> doap:maintainer <#ben>

The parser can be downloaded at the ARC site (documentation).
I've also put up a little demo service if you want to test the parser.

ARC API and Website

With the API release ARC gets its own Website.
ARC logo As of today, ARC has its own website: arc.web-semantics.org. It comes with a new releaseM4SH, pronounced "mash") which can be converted to HTML or RDF/XML. The HTML is going to be either eRDF- or RDFa-enhanced, I haven't decided yet. At the moment it simply produces classical HTML.

Slides from SemWeb Workshop in Hamburg

Talking about SemWeb and Web 2.0 at Edeka
serious now ;-) I ran a SemWeb Workshop in Hamburg this week. Just a small one, less than 10 participants, but it was at EDEKA, Germany's largest food retailer, and I didn't really know what to expect. Luckily, everything went well. Friendly folks and lots of interesting questions/discussions (although they suggested to rename SPARQL, given that SPAR is an EDEKA brand now ;). It still feels a bit strange to talk about RDF and related stuff in german, my slides had some weird english-german messed-up translations. (As usual, I started creating them too late and had the silly idea to "quickly" build an own html slideshow thingy.)

Not sure if it's interesting for anyone reading this blog (it's all in german) but I've uploaded the slides (HTML, navigation via arrow keys or footer bar, TOC via "t", bugs for free). It's not really much, I mainly used them to not drift off too far, and to test Opera's kiosk mode, which is really cool.

ARC RDF Store for PHP - enSPARQL your LAMP

ARC RDF Store release
A first version of ARC RDF Store is now available. It's written entirely in PHP and has been optimized for basic LAMP environments where install and system configuration privileges are often not available. As with the other ARC components, I tried to keep things modular for easier integration in other PHP/MySQL-based systems.

A full store installation consists of just 7 files (

JSONC, JSONI, JSONP: Creating tailored JSON from SPARQL query results

Returning SPARQL results as optimized JSON string
Update (2006-02-05): Elias Torres sent me a pointer to a draft of a related tech note he is working on with other DAWG members. I've adjusted my serialiser, so that its output is closer to their proposal now. After testing the different serialisation options, I've also updated the JSONI format. The examples below show the changes already.

My current sparqlet (SPARQL-driven portlet) implementations mostly use queries generated server-side, the results are returned as application-specific JavaScript. While this approach allows certain bandwidth and convenience optimisations, it always needs custom code on the server.

For generic operations, SPARQL endpoints offer an XML format which can be consumed by in-browser applications via XHR techniques. However, the SPARQL result XML format makes things quite bloated, and what I learned from Web 2.0 coders is that JSON results are often preferred.

It's rather straightforward to generate JSON code equivalent to the XML structure. I know that several people are working on this, but I couldn't find any public versions. I hope my ARC stuff isn't too different, but I can easily tweak it later. Here is a sample of a default SPARQL JSON result returned from an ARC server:
SELECT DISTINCT ?s ?p ?o
WHERE { ?s ?p ?o }
LIMIT 20
{
 head: {
  vars: ["s", "p", "o"]
 },
 results: {
  distinct: true,
  ordered: false,
  compact: false,
  indexed: false,
  bindings: [
   {
    s: {
     type: "bnode",
     value: "b2490b2520bf2872200093194ff36f465_id2245308"
    },
    p: {
     type: "uri",
     value: "http://xmlns.com/foaf/0.1/weblog"
    },
    o: {
     type: "uri",
     value: "http://www.nzlinux.org.nz/blogs/"
    }
   },
   {
    s: {
     type: "uri",
     value: "http://www.nzlinux.org.nz/blogs/"
    },
    p: {
     type: "uri",
     value: "http://www.w3.org/2000/01/rdf-schema#seeAlso"
    },
    o: {
     type: "uri",
     value: "http://www.nzlinux.org.nz/blogs/wp-rdf.php?cat=9"
    }
   },
     ...

  ]
 }
}  
(I'm using associative arrays for the bindings in order to reduce bandwidth a bit. Didn't put too much work into this default serialisation, it's probably going to change when there is a recommended format available.)

However, like the XML result, this JSON alternative is not the most efficient when a consuming app doesn't need the typing info of the individual bindings (uri/bnode/literal). I played around with some pre-defined "compact" JSON formats, but looking at the queries I'm using in my stuff, there are often cases, where I want the typing info for one or two of the bindings, but not for the rest. The solution I implemented for the ARC RDF Store looks like this: The user can specify an optional jsonc argument which defines whether a binding should be serialised entirely or if it can be flattened:
PREFIX foaf:   <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?p1 ?p1_name ?p2_name ?p2_mbox_sha1 WHERE { ?p1 foaf:name ?p1_name ; foaf:knows ?p2 . ?p2 foaf:name ?p2_name ; foaf:mbox_sha1sum ?p2_mbox_sha1 . } ORDER BY ?p1_name LIMIT 30
jsonc="p1(),p1_name,p2_name,p2_mbox_sha1";
{
 head: {
  variables: ["p1", "p1_name", "p2_name", "p2_mbox_sha1"]
 },
 results: {
  distinct: true,
  ordered: true,
  compact: true,
  indexed: false,
  bindings: [
   {
    p1: {
     type: "bnode",
     value: "b2e9ddd5ebb264646b852dcd207e13d8a_bn1"
    },
    p1_name: "Jim Ley",
    p2_name: "Jeremiah McElroy",
    p2_mbox_sha1: "f0d988b33153f21479cffa647cbe6faac65a98f8"
   },
   {
    p1: {
     type: "bnode",
     value: "b2e9ddd5ebb264646b852dcd207e13d8a_bn1"
    },
    p1_name: "Jim Ley",
    p2_name: "Mart Sanderson",
    p2_mbox_sha1: "ce3165ecf98cdb6d8153503949b320e24a6138a0"
   },
   ...
 ]
Appending parentheses to a result variable activates the complete serialisation, the other vars will be flattened. The jsonc parameter can also be used to remove selected result variables from the returned JSON. This may be helpful in in cases when they were needed to retrieve the SPARQL result set (e.g. in combination with DISTINCT), but aren't actually used in the client app.

JSONC can help reduce bandwidth and browser memory consumption, but it doesn't really add too much to the front-end developer's convenience. The RDF model is graph-based and resource-oriented, but SPARQL results are tabular with usually a lot of repeated values. Therefore a developer has to process the code before resource-centric views can be displayed. What's missing (if we want to avoid custom, server-side code or heavy pre-processing on the client) is a way to tell the SPARQL endpoint to arrange and index the tabular results before they are serialised as JSON: JSONI. The jsoni parameter works similar to the jsonc one, but it allows nesting of result vars to specify index structures:
jsoni="p1_name(p2_name)";
[codeblock { head: { variables: ["p1_name"] }, results: { distinct: true, ordered: true, compact: false, indexed: true, index: { p1_name: [ { value: "Jim Ley", type: "literal", index: { knows: [ "Jeremiah McElroy", "Mart Sanderson" ] } }, { value: "Leandro Mariano L

Cologne's 2nd Web Montag : "SemWeb and Web 2.0"

Gave a semweb talk at webmontag.
I gave a little talk on "SemWeb and Web 2.0" at yesterday's Web Montag in Cologne. A very nice (and lounge-y) event (BarCamp-like) that I'm surely going to attend again. It's an hour by train away from Essen, so still close enough.
web montag lounge, cologne
Met very interesting and (unexpectedly) also very interested folks. I only had a SemWeb-in-one-slide presentation, but it turned into a nice interactive discussion very quickly, with lots of smart questions, and a 2-hour Q&A afterwards. And we even managed to discuss the possible application of RDF technologies to Web 2.0 software ("mashup-chaining" and the like).

Gartner's Research VP Alexander Linden talked about SemWeb opportunities and problems from an analyst perspective which was very interesting as well.
Gartner Hype Cycle

Thinking about the other demos (The DOJO framework, and a promising group calendaring app called Reminderix), the atmosphere and feedback at Web Montag, and then having a closer look at Gartner's Hype Cycle, I really think it's about time for SemWeb folks to put more effort in getting the mainstream Web community involved (before the Web 2.0 hype reaches the peak). And we need more apps and demos, no matter how simple. Imagine a feed aggregator where you could hover a contributor's name and would get a nice inline profile preview, with a list of the person's last 5 posts, or articles he/she commented on. That sort of stuff. Mentioning Oracle and Adobe raises attention, but as long as there are no convincing and cool Web(!) apps out there, frontend-oriented developers are not likely to invest the time into learning RDF (We know it's not as hard as the critics say, but it's also not as simple as we often claim). We should have high hopes in SPARQL! A small list of impressions (and notes for myself) related to my talk and reactions:
  • people understand the triple (subject-predicate-object) idea and the possibilities of such a generic model, esp. when introduced via known vocabs such as dublin core
  • you lose 'em somewhere between RDFS and OWL (showing a class tree with some annotations can help)
  • they are back when you give a simple SPARQL example
  • "how do I connect my data(base) to the semantic web?"
  • "how do I find the data?"
  • "cool. now, where do I start?"
  • "what does (program) code look like?" (view-source?)
  • "where is the connection to my HTML pages?" (link between the clickable web and the semantic web)
  • "known apps?"
  • "known apps?"
  • "known apps?"
  • "how does such a system handle redundancy and missing information"
  • "which tools would I need, can I work on the lower layers only?"

written in a hurry, may add more points later. The basic thing is that these have been very practical questions, so this is a shout-out (including myself) to provide more hands-on guides and webby demos.

</bla>

Web Monday / Web Montag, 2006-01-30 in Cologne, Germany

Web Montag in Cologne.
The Web 2.0 buzz brings back some nice habits from dot-com times: Web Developer get-togethers. I'll have a go at next week's Web Montag (in Cologne): Web Monday - connects users, developers, founders, entrepreneurs, researchers, web pioneers, bloggers, podcasters, designers and other folks interested in Web 2.0 topics (in the broadest sense) We don't have a beamer yet, and folks may shy away from the event, now that I've put "SemWeb" on the agenda ;)

We'll see ...

ARC SPARQL2SQL Rewriter for PHP v0.2.0

Updated SPARQL2SQL rewriter for ARC
OK, I could easily spend another month on this beast, but it should be good enough for my current projects and I really have to continue with those now. The v0.2.0 rewriter converts a structure created by the ARC SPARQL Parser to SQL code. This allows pushing SPARQL query processing to a mySQL database engine, thus working around the PHP performance bottleneck on hosted Web servers (well, not only there ;). It supports only a subset of the specification, but I tried to cover the most common cases. The rewriter doesn't make much sense as a stand-alone component (unless you are an RDF infrastructure developer), but I'll keep its revisions separate from the upcoming ARC RDF Store.

Unfortunately, the W3C test cases are provided in n3 only, but I managed to at least scrape the examples from the working draft. As you may be able to see in that document, the rewriter cannot handle multiple/nested UNIONs, combined expressions, some of the built-ins (e.g. lang, langMatches), custom functions, and several other features yet, but it can convert triple patterns, OPTIONALs (simple, grouped, or nested), simple UNIONs, simple REGEXes (translated to LIKEs where possible), GRAPH queries, and dataset restrictions (although I'm still not 100% sure if I completely understood the FROM NAMED stuff). I also included the optimisation stuff I wrote about last week: A list of property alternatives can be provided which will then be rewritten to embedded ORs instead of using UNIONs. And the rewriter is able to create SQL for a split up triple table space (How to split is customisable, I'm going to write more about this when the store is released).

ARC Store design fine-tuning

Getting close to releasing ARC RDF Store
I still haven't released ARC Store as I'm continually discovering optimisation possibilities while working on the SPARQL2SQL rewriter. The latter comes along quite well, just ticked off simple UNIONs, numeric comparisons, grouped OPTIONALS, and nested (yay!) OPTIONALs on my to-do list. I'm currently struggling a little bit with GRAPH queries combined with datasets (restricted via FROM / FROM NAMED) but that's more due to spec reading incapabilities than to mySQL's interpretation of SQL. I'm going to implement a subset of the SPARQL built-ins (e.g. REGEX), but after that the store should finally be usable enough for a first public release.

However, I'm not that used to relational algebra and there are lots of mySQL-specific options, so I frequently used the manual to find out how to e.g. construct SQL UNIONs and LEFT JOINs, or how to make things easier for the query optimizer. I wrote already about RDF store design considerations last month but it looks like there's more room for optimisation:

Shorter index keys

I'm still using CHAR columns for the hashes, but instead of using the hex-based md5 of an RDF term, I'm converting the md5 to a shorter string (without loosing information) now. The CHAR column uses a full byte for each character, but the characters in an md5 string are all from [0-9a-f] (i.e. a rather small 16-character set). Taking the md5 hash as a base-16 number, I can easily shorten it when I use a longer character set. As I said before, PHP can't handle large integers, so I split the md5 string in three chunks, converted each part to an integer, and then re-encoded the result with a different, larger set of characters. I first used
'0123456789 abcdefghijklmnopqrstuvwxyz!?()+,-.@;=[]_{}'
(54 characters) which reduced the overall column size to 23 (-28%). Then I found out that BINARY table columns do case-sensitive matching and may even be faster, so I could change the set to
'0123456789 abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!?()+,-.@;=[]_{}'
(79 chars).

The column size is now 21 (66% of the initial md5). Taking only a sub-portion of the md5 hash (as e.g. done by 3store) could improve things further. This may all sound a little bit desperate (that's at least what mySQL folks said), but as the ARC Store is probably going to be the only SPARQL engine optimised for basic shared web hosting environments, I assume it's worth the extra effort. Note that overall storage space is not (yet) my main concern, it's the size of the indexes used for join operations.

OR instead of UNION

SPARQL UNIONs can't always be translated to SQL ORs (at least I couldn't figure out how), so using SQL's UNION construct is the better way to be compliant. However, for most practical use cases for UNIONs (alternative predicates), a simple WHERE (p='rdfs:label' OR p='foaf:name' OR ...) is much faster than a union. I don't know how to efficiently automate the detection of when to rewrite to ORs, I'll probably have to make that API-only.

Splitting up the table space

I think TAP and Jena offer ways to separate selected statements from the main triple table, thus accelerating certain joins and queries (and saving storage space). I also read about this strategy in a more recent blog post by Chimezie Ogbuji who describes an approach with a dedicated rdf:type table.

The problem with a generic solution is a) to decide how to split up the triples, and b) how to efficiently run queries over the whole set of split tables (e.g. for <foo> ?p ?o patterns).

re a): A table for rdf:type is a reasonable idea, 25% in the datasets I worked with so far were rdf:type statements, with another 10% used by dc:date and foaf:name, but the numbers of FOAF and DC terms are clearly application-specific. In order to speed up joins, it might also be useful to completely separate object2object relations from those relating resources to literals (e.g. in CONFOTO, the latter consume over 40%).

re b): From the little experience I gained so far, I don't expect UNIONs or JOIN/OR combinations to be sufficiently fast. But mySQL has a MERGE storage engine which is "a collection of identical MyISAM tables that can be used as one". This allows efficient queries on selected tables (e.g. for joins, or rdf:type restrictions) and ok-ish performance when the whole set of tables has to be included in a query.

I'm still experimenting, may well be that I only go for the first optimisation in ARC store v0.1.0, but the other ones are surely worth further considerations.

Archives/Search

YYYY or YYYY/MM
No Posts found

Feeds