finally a bnode with a uri

Pragmatic design considerations for a PHP-based SPARQL store

Design of a new SPARQL store for ARC
As mentioned earlier, I had to re-think my initial RDF store design because of exploding query execution times. This was mainly caused by redundant subject-predicate-object triples (for different graphs), but also by inefficient indexes. This post lists some thoughts and considerations concerning a new store which I'm going to make available soon.

The main motivation was the development of an RDF store that can be deployed to shared Web hosting environments. Something that could easily be integrated in existing Web applications like Blogs or CMSs.

PHP is clearly not the most elegant programming language, but it is widespread and continues to attract masses of Web developers. Of course I wouldn't say "Choose PHP if you want to build the perfect RDF store", it just happens to be the scripting language I'm using for most of my projects, so the question is more like "I'm using PHP, how can I still come up with a reasonably sound RDF store?". I went through 2 improvement cycles since April, so here is my list of current considerations, decisions, and lessons learned:

Native/Custom storage vs. RDBMS

Given the prerequisites (shared Web hosting and PHP), I don't think I should even try to build a native storage engine. Furthermore, every environment where running my RDF store may make sense should provide a mySQL database.

Projects like 3store demonstrate that scalable RDF stores can be built using relational databases. Additionally, pushing query execution to the database management system can help me work around PHP's performance limitations and at the same time benefit from existing and mature query optimizers.

Translating SPARQL queries to SQL is not trivial, but it is possible for a reasonable subset of the specification. I consider it at least simpler than having to re-invent things like sorting etc. And I wouldn't have several months time to figure out low-level operations anyway. MySQL comes with (near-)equivalents to SPARQL's DISTINCT, ORDER, LIMIT, OFFSET, OPTIONAL, UNION, and even several FILTER options for free.

Table layout, level of normalization

My initial store used a denormalized schema with varchar and text columns in a single triple table. Implementing this approach is straightforward. It requires more disk space, but certain queries can be executed quite efficiently (see "Jena2 Database Interface - Database Layout" for details). However, this table layout is not optimized for multi-join queries: MySQL can only use its fast static format when a table contains fixed-length columns only. Moreover, indexing varchar columns requires more space (which leads to more and slower disk seeks), and the length of an index key is often limited to 500 (you know, shared hosting etc.), so multi-column indexes can't be properly created at all. (Richard Cyganiak's Report on DB layouts for SPARQL datastores provides additional information.)

So, the strings had to be replaced by identifiers pointing at the values in a separate table. The shorter the IDs, the faster the look-ups, which suggests using integers. I didn't want to use a simple counter, but some hash function for easier replication and merging of RDF data. Unfortunately, PHP doesn't provide a crc64 function but only a 32-bit crc32() to convert strings to integers which may not sufficiently prevent hash collisions. I fell back to 33-character columns containing md5 hashes (1 character saved for potential later use). I removed all variable-length columns from the triple table and store all values in a separate hash2val table instead.

The reason why I had to re-design my implementation was however only partly related to memory-consuming, restricted indexes or inefficient joins on varchar columns. My main problem was the "Cartesian Catastrophe" caused by subject-predicate-object duplicates for different graphs. This effect can partly be reduced by an spo-index (omitting the graph column), but when sorting or a high offset is involved, indexing didn't seem effective enough. I experimented with an external graph table and a table to store triple2graph relations, but I ran into huge problems when I tried to implement reversible resource consolidation (aka Un-smushing) on top of that. Steve Harris shared some experiences the week before ISWC, there seem to be other issues with external graph tables as well.

I pondered quite some time how to address the trade-off between being able to do graph queries vs. efficient triple queries where graph information is irrelevant. I'm trying to solve the problem with two (identical) subject-predicate-object-graph tables now. Not sure if things work out as expected in the long term, but I'm quite happy so far. I can keep the same table layout for each use case (triple queries / frequent quad queries / optimized triple queries with occasional quad queries). The main table is used for inserts and triple queries. It's similar to my initial store's triple table, just with hashes and a fixed format. But this time I've added a method to the store API which can move s-p-o duplicates to the 2nd table, putting the 1st table in a state where the combinatorial explosion is less likely to happen. As the 2nd table only contains the duplicates, the disk space needed stays (almost) the same, and with the two tables being identically structured, I can use a UNION SQL query to run SPARQL graph queries. The tables so far:
CREATE TABLE [prefix]_arc_triple[_bak] (
  s char(33) NOT NULL,
  p char(33) NOT NULL,
  o char(33) NOT NULL,
  g char(33) NOT NULL,
  s_type tinyint(1) NOT NULL default '0',
  s_init char(33) NOT NULL,
  o_type tinyint(1) NOT NULL default '0',
  o_init char(33) NOT NULL,
  o_lang char(8) NULL,
  o_dt char(33) NOT NULL,
  o_comp char(255) NULL,
  misc tinyint(2) NOT NULL default '0',
  [index definition]
)
CREATE TABLE [prefix]_arc_hash2val(
  hash varchar(33) NOT NULL,
  val text NULL,
  PRIMARY KEY (hash)
)
The time to retrieve the values for hashes in the result set seems by the way negligible compared to the time I gain due to the shorter indexes. For the majority of expensive pattern joins (e.g. "WHERE triple1.o=triple2.s"), the hash2val table isn't needed at all.

The s_init and o_init fields in the triple tables are used to enable the smusher's undo feature. The o_comp column is a naive but effective way to implement ordering without having to join the hash2val table. The store converts literals to comparable values by adding leading zeros or adjusting date strings (e.g. a simple string comparison on "3", "12" will return "3">"12", the o_comp column will store the numbers as "+00000000000000000003", and "+00000000000000000012" strings respectively, which makes comparisons return correct results). The o_comp column can also be used for faster keyword searches in short literals such as names or titles.

Indexes

Indexing comes with a trade-off between INSERT and SELECT speed. But with the possibility to lock tables during inserts, I went for query-optimized indexes. Paul Gearon described KOWARI's index structure (in May 2004, but I assume they didn't change them in the meantime), YARS' indexes are similar: 6 indexes to cover any possible query pattern. Phil Dawes followed the KOWARI strategy for Veudas as well. These implementations all store complete 4-part keys, although Andreas' YARS paper contains a list of minimal indexes needed for the different query patterns. Initially in order to save disk space, I created my own set of those shorter indexes and discovered that they can also improve the speed of joins. 3store even uses nothing but single-column indexes (if I read the schema correctly).

The disadvantage of long indexes is that they contain duplicates even when joins only require the first one or two key parts. MySQL can also keep more indexes in memory when they are short which may be helpful in those environments where my store is going to be used.

I ended up with 8 indexes:
  1. spog (duplicate checks/copying)
  2. sog (relations between 2 resources)
  3. pog (resource consolidation on IFPs)
  4. gsp (provenance queries)
  5. sg (joins)
  6. pg (resource consolidation, maybe union queries)
  7. og (joins)
  8. o_comp (sorting, keyword searches)

I'm currently working on a new SPARQL2SQL rewriter which is the missing piece between the ARC SPARQL parser's result and querying the store. One or two more blog posts may be needed before I'm done, but I'm making progress.

Comments are disabled for this post.

Earlier Posts

Later Posts

Archives/Search

YYYY or YYYY/MM
No Posts found

Feeds