finally a bnode with a uri

ARC SPARQL Parser and SQL Rewriter for PHP

Adding SPARQL functionality to ARC.
A query interface is one of the last things I need to code for my SemWeb platform. I still haven't looked closely at the SPARQL protocol which will hopefully bring us standardized HTTP bindings, but what I've done during the last days was writing a basic SPARQL query parser and an experimental SQL rewriter to run queries against an RDF store.


The thing I was most interested in was sorting. And maybe optionals. My browser was already using limit and offset, but the query function to retrieve triples form the store does not allow constraints beyond resource type and keywords. And the browser's result list is actually generated in two steps. First, I'm retrieving matching resource identifiers, and then, in a second step, I'm reading label information etc. from the store for each resource in the result set. This makes sorting almost impossible. What I wanted was a way to retrieve arbitrary property values without having to do lots of database requests. Being able to not hard-code the retrieval function is of course another nice thing of a spec'd query language.

The first version (source) of the parser is bigger than expected (45KB), partly due to several redundant but separately written regular expressions, but also because I'm not very used to walking through grammars. I skipped quite some features for the moment, and the turtle parser is nothing to be proud of. However, the script seems to work ok for standard queries. The result of the parsing process is a structured array with all the elements and information found in the query. The parser also checks for (potential) errors (but doesn't stop parsing) which can be checked before e.g. tyring to generate SQL from the returned result:
$q=get_http_arg("query_code", "post");
$parser=new ARC_sparql_parser();
 echo "Errors (or unsupported commands) in query:<br />";
 echo "- ".join("<br />- ", $errors);
The $infos variable is an associative array containing the following entries:
  • base: The query base, if set via BASE
  • prefixes: An array of prefix2uri mappings. Relative URIs are already adjusted if a base is available.
  • result_form: A container with information about the result type ("select", "construct", etc.) and result-type-specific details (e.g. distinct, result_vars, return_all, etc.). I haven't spent much time with the non-SELECT result types, some of them may not be parsed correctly.
  • vars: an array of all the query variables found. (Without leading '?'s and '$'s.)
  • graph_pattern: The query's "root" graph pattern with an elements entry containing a list of sub-pattern elements (triples, optional_graph_patterns, union_graph_patterns, ...) which themselves may contain further elements or a triples array in case of a triples element.
  • offset, if available
  • limit, if available
  • order_conditions, if available

SPARQL to SQL Rewriter

Although the parser allowed me now to access the different parts of a SPARQL query, I couldn't really figure out how to get from there to SQL. I had a basic idea how to do the optionals, but I was a little bit clueless how to generate the alias mappings etc. Thanks to #swig residents Libby, Morten, and Eric who pointed me at algorithms and Matt Biddulph's excellent converter write-up, I could implement a basic SQL rewriter. The first version supports:
  • SELECTs (the other query types are not supported yet)
  • specified variable bindings or bindings for any variable (SELECT *)
  • DISTINCT results
  • restricting the query to a background graph (FROM)
  • multiple triple patterns (haven't tried UNION and nested graph patterns yet, could perhaps work already.)
  • OPTIONAL triple patterns

The script can't handle constraints (Filter foo) or collections ((foo bar baz)) yet.
The rewriter is adapted to the (not yet available) ARC store, which is one of those simple mySQL-based RDF stores. It consists of two tables, one rdf_source table for sources, and a de-normalized triple table (omitting indexes here as they are not that well thought-through anyway..):
CREATE TABLE [prefix]_rdf_source (
  rdf_source_id int(10) unsigned NOT NULL default '1',
  url varchar(255) default NULL,
  rdfxml_hash varchar(255) default NULL,
  modified_uts bigint(18) unsigned NOT NULL default '0',

CREATE TABLE [prefix]_triple (
  triple_id bigint(20) unsigned NOT NULL default '1',
  rdf_source_id bigint(20) unsigned default '0',
  s varchar(255) default NULL,
  s_merged varchar(255) default NULL,
  p varchar(255) default NULL,
  o text,
  o_merged varchar(255) default NULL,
  s_is_bnode enum('t','f') NOT NULL default 'f',
  o_is_bnode enum('t','f') NOT NULL default 'f',
  o_is_literal enum('t','f') NOT NULL default 'f',
  o_is_long_literal enum('t','f') NOT NULL default 'f',
  o_lang varchar(7) default NULL,
  o_dt varchar(255) default NULL,
  o_short_data varchar(255) default NULL,
  o_hash varchar(255) default NULL,
  is_inferred enum('t','f') NOT NULL default 'f',
  added_uts bigint(18) unsigned default '0',
  modified_uts bigint(18) unsigned default '0',

In order to correctly map the SQL to an actual ARC store, the rewriter has to be instantiated with a table prefix:
$rewriter=new ARC_sparql_sql_rewriter($args);
Another store-specific thing is the use of s_merged and o_merged columns. Those are needed for the (un-)smusher I mentioned a couple of days ago. In order to catch both literals and resources I have to use a CONCAT (Tx.o_merged, Tx.o_short_data) which may make things slower due to missing indexes.

Finally, I'm URL-encoding database values which is an old relict from times when phpMyAdmin couldn't display quotation marks and tags in table fields.

But apart from that, the schema doesn't make the rewriting algorithm too complicated (my summary probably is, though) :
  1. Check the infos array for the result type and call the appropriate method (if available). At the moment, only SELECTs are supported.
  2. Add "SELECT" and "DISTINCT" (if set)
  3. Recursively walk through the pattern array until the element type is "triples". Maintain a counter for the clauses (for the number of neccessary table aliases) and an array of variable occurrences. For each triple element (s, p, o): create an SQL snippet if the element is a URI, if it's a variable of a bnode: add the current "location" (e.g. "T4.s" if the subject of the 4th clause is a variable) to the variable occurrences array. Additionally, maintain a list of aliases that are part of an OPTIONAL clause. Parenthesize OPTIONAL clauses and directly add alias mappings via LEFT JOINS.
  4. Create and add alias mappings for every variable in the occurences, but only if they are not part of an OPTIONAL clause.
  5. If a background graph is specified, add a mapping to the rdf_source table and joins for every variable. (I can't correctly map the background graph to OPTIONALs yet. They may contain values from other sources..).
  6. use the variable occurrences array to add the bound result variables to the main SQL code.
  7. add ORDER BY information
  8. add LIMIT and OFFSET information
Hm, I guess I should program another rewriter now to transform the list above to something readable. Hope you can make some sense of it when you look at the rewriter source..


I've put up a demo which will allow you to run test queries against the store. You can add RDF/XML sources via a form at the site's RDF Browser (lower right corner). I've added two sample queries to retrieve 10 persons with name and optional weblog or 10 blog item titles from planet rdf, sorted by date.

Just don't expect the stuff to be perfect. There is a lot of need for improvements. But knowing that several people are currently trying to add SPARQL support to PHP systems, I thought I'd go for an early release. Maybe someone is interested in extending or helping me bugfix the initial versions.

Comments are disabled for this post.

Earlier Posts


No Posts found