finally a bnode with a uri

Posts tagged with: store

ARC RDF Store for PHP - enSPARQL your LAMP

ARC RDF Store release
A first version of ARC RDF Store is now available. It's written entirely in PHP and has been optimized for basic LAMP environments where install and system configuration privileges are often not available. As with the other ARC components, I tried to keep things modular for easier integration in other PHP/MySQL-based systems.

A full store installation consists of just 7 files (

ARC Store design fine-tuning

Getting close to releasing ARC RDF Store
I still haven't released ARC Store as I'm continually discovering optimisation possibilities while working on the SPARQL2SQL rewriter. The latter comes along quite well, just ticked off simple UNIONs, numeric comparisons, grouped OPTIONALS, and nested (yay!) OPTIONALs on my to-do list. I'm currently struggling a little bit with GRAPH queries combined with datasets (restricted via FROM / FROM NAMED) but that's more due to spec reading incapabilities than to mySQL's interpretation of SQL. I'm going to implement a subset of the SPARQL built-ins (e.g. REGEX), but after that the store should finally be usable enough for a first public release.

However, I'm not that used to relational algebra and there are lots of mySQL-specific options, so I frequently used the manual to find out how to e.g. construct SQL UNIONs and LEFT JOINs, or how to make things easier for the query optimizer. I wrote already about RDF store design considerations last month but it looks like there's more room for optimisation:

Shorter index keys

I'm still using CHAR columns for the hashes, but instead of using the hex-based md5 of an RDF term, I'm converting the md5 to a shorter string (without loosing information) now. The CHAR column uses a full byte for each character, but the characters in an md5 string are all from [0-9a-f] (i.e. a rather small 16-character set). Taking the md5 hash as a base-16 number, I can easily shorten it when I use a longer character set. As I said before, PHP can't handle large integers, so I split the md5 string in three chunks, converted each part to an integer, and then re-encoded the result with a different, larger set of characters. I first used
'0123456789 abcdefghijklmnopqrstuvwxyz!?()+,-.@;=[]_{}'
(54 characters) which reduced the overall column size to 23 (-28%). Then I found out that BINARY table columns do case-sensitive matching and may even be faster, so I could change the set to
'0123456789 abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!?()+,-.@;=[]_{}'
(79 chars).

The column size is now 21 (66% of the initial md5). Taking only a sub-portion of the md5 hash (as e.g. done by 3store) could improve things further. This may all sound a little bit desperate (that's at least what mySQL folks said), but as the ARC Store is probably going to be the only SPARQL engine optimised for basic shared web hosting environments, I assume it's worth the extra effort. Note that overall storage space is not (yet) my main concern, it's the size of the indexes used for join operations.

OR instead of UNION

SPARQL UNIONs can't always be translated to SQL ORs (at least I couldn't figure out how), so using SQL's UNION construct is the better way to be compliant. However, for most practical use cases for UNIONs (alternative predicates), a simple WHERE (p='rdfs:label' OR p='foaf:name' OR ...) is much faster than a union. I don't know how to efficiently automate the detection of when to rewrite to ORs, I'll probably have to make that API-only.

Splitting up the table space

I think TAP and Jena offer ways to separate selected statements from the main triple table, thus accelerating certain joins and queries (and saving storage space). I also read about this strategy in a more recent blog post by Chimezie Ogbuji who describes an approach with a dedicated rdf:type table.

The problem with a generic solution is a) to decide how to split up the triples, and b) how to efficiently run queries over the whole set of split tables (e.g. for <foo> ?p ?o patterns).

re a): A table for rdf:type is a reasonable idea, 25% in the datasets I worked with so far were rdf:type statements, with another 10% used by dc:date and foaf:name, but the numbers of FOAF and DC terms are clearly application-specific. In order to speed up joins, it might also be useful to completely separate object2object relations from those relating resources to literals (e.g. in CONFOTO, the latter consume over 40%).

re b): From the little experience I gained so far, I don't expect UNIONs or JOIN/OR combinations to be sufficiently fast. But mySQL has a MERGE storage engine which is "a collection of identical MyISAM tables that can be used as one". This allows efficient queries on selected tables (e.g. for joins, or rdf:type restrictions) and ok-ish performance when the whole set of tables has to be included in a query.

I'm still experimenting, may well be that I only go for the first optimisation in ARC store v0.1.0, but the other ones are surely worth further considerations.

Pragmatic design considerations for a PHP-based SPARQL store

Design of a new SPARQL store for ARC
As mentioned earlier, I had to re-think my initial RDF store design because of exploding query execution times. This was mainly caused by redundant subject-predicate-object triples (for different graphs), but also by inefficient indexes. This post lists some thoughts and considerations concerning a new store which I'm going to make available soon.

The main motivation was the development of an RDF store that can be deployed to shared Web hosting environments. Something that could easily be integrated in existing Web applications like Blogs or CMSs.

PHP is clearly not the most elegant programming language, but it is widespread and continues to attract masses of Web developers. Of course I wouldn't say "Choose PHP if you want to build the perfect RDF store", it just happens to be the scripting language I'm using for most of my projects, so the question is more like "I'm using PHP, how can I still come up with a reasonably sound RDF store?". I went through 2 improvement cycles since April, so here is my list of current considerations, decisions, and lessons learned:

Native/Custom storage vs. RDBMS

Given the prerequisites (shared Web hosting and PHP), I don't think I should even try to build a native storage engine. Furthermore, every environment where running my RDF store may make sense should provide a mySQL database.

Projects like 3store demonstrate that scalable RDF stores can be built using relational databases. Additionally, pushing query execution to the database management system can help me work around PHP's performance limitations and at the same time benefit from existing and mature query optimizers.

Translating SPARQL queries to SQL is not trivial, but it is possible for a reasonable subset of the specification. I consider it at least simpler than having to re-invent things like sorting etc. And I wouldn't have several months time to figure out low-level operations anyway. MySQL comes with (near-)equivalents to SPARQL's DISTINCT, ORDER, LIMIT, OFFSET, OPTIONAL, UNION, and even several FILTER options for free.

Table layout, level of normalization

My initial store used a denormalized schema with varchar and text columns in a single triple table. Implementing this approach is straightforward. It requires more disk space, but certain queries can be executed quite efficiently (see "Jena2 Database Interface - Database Layout" for details). However, this table layout is not optimized for multi-join queries: MySQL can only use its fast static format when a table contains fixed-length columns only. Moreover, indexing varchar columns requires more space (which leads to more and slower disk seeks), and the length of an index key is often limited to 500 (you know, shared hosting etc.), so multi-column indexes can't be properly created at all. (Richard Cyganiak's Report on DB layouts for SPARQL datastores provides additional information.)

So, the strings had to be replaced by identifiers pointing at the values in a separate table. The shorter the IDs, the faster the look-ups, which suggests using integers. I didn't want to use a simple counter, but some hash function for easier replication and merging of RDF data. Unfortunately, PHP doesn't provide a crc64 function but only a 32-bit crc32() to convert strings to integers which may not sufficiently prevent hash collisions. I fell back to 33-character columns containing md5 hashes (1 character saved for potential later use). I removed all variable-length columns from the triple table and store all values in a separate hash2val table instead.

The reason why I had to re-design my implementation was however only partly related to memory-consuming, restricted indexes or inefficient joins on varchar columns. My main problem was the "Cartesian Catastrophe" caused by subject-predicate-object duplicates for different graphs. This effect can partly be reduced by an spo-index (omitting the graph column), but when sorting or a high offset is involved, indexing didn't seem effective enough. I experimented with an external graph table and a table to store triple2graph relations, but I ran into huge problems when I tried to implement reversible resource consolidation (aka Un-smushing) on top of that. Steve Harris shared some experiences the week before ISWC, there seem to be other issues with external graph tables as well.

I pondered quite some time how to address the trade-off between being able to do graph queries vs. efficient triple queries where graph information is irrelevant. I'm trying to solve the problem with two (identical) subject-predicate-object-graph tables now. Not sure if things work out as expected in the long term, but I'm quite happy so far. I can keep the same table layout for each use case (triple queries / frequent quad queries / optimized triple queries with occasional quad queries). The main table is used for inserts and triple queries. It's similar to my initial store's triple table, just with hashes and a fixed format. But this time I've added a method to the store API which can move s-p-o duplicates to the 2nd table, putting the 1st table in a state where the combinatorial explosion is less likely to happen. As the 2nd table only contains the duplicates, the disk space needed stays (almost) the same, and with the two tables being identically structured, I can use a UNION SQL query to run SPARQL graph queries. The tables so far:
CREATE TABLE [prefix]_arc_triple[_bak] (
  s char(33) NOT NULL,
  p char(33) NOT NULL,
  o char(33) NOT NULL,
  g char(33) NOT NULL,
  s_type tinyint(1) NOT NULL default '0',
  s_init char(33) NOT NULL,
  o_type tinyint(1) NOT NULL default '0',
  o_init char(33) NOT NULL,
  o_lang char(8) NULL,
  o_dt char(33) NOT NULL,
  o_comp char(255) NULL,
  misc tinyint(2) NOT NULL default '0',
  [index definition]
)
CREATE TABLE [prefix]_arc_hash2val(
  hash varchar(33) NOT NULL,
  val text NULL,
  PRIMARY KEY (hash)
)
The time to retrieve the values for hashes in the result set seems by the way negligible compared to the time I gain due to the shorter indexes. For the majority of expensive pattern joins (e.g. "WHERE triple1.o=triple2.s"), the hash2val table isn't needed at all.

The s_init and o_init fields in the triple tables are used to enable the smusher's undo feature. The o_comp column is a naive but effective way to implement ordering without having to join the hash2val table. The store converts literals to comparable values by adding leading zeros or adjusting date strings (e.g. a simple string comparison on "3", "12" will return "3">"12", the o_comp column will store the numbers as "+00000000000000000003", and "+00000000000000000012" strings respectively, which makes comparisons return correct results). The o_comp column can also be used for faster keyword searches in short literals such as names or titles.

Indexes

Indexing comes with a trade-off between INSERT and SELECT speed. But with the possibility to lock tables during inserts, I went for query-optimized indexes. Paul Gearon described KOWARI's index structure (in May 2004, but I assume they didn't change them in the meantime), YARS' indexes are similar: 6 indexes to cover any possible query pattern. Phil Dawes followed the KOWARI strategy for Veudas as well. These implementations all store complete 4-part keys, although Andreas' YARS paper contains a list of minimal indexes needed for the different query patterns. Initially in order to save disk space, I created my own set of those shorter indexes and discovered that they can also improve the speed of joins. 3store even uses nothing but single-column indexes (if I read the schema correctly).

The disadvantage of long indexes is that they contain duplicates even when joins only require the first one or two key parts. MySQL can also keep more indexes in memory when they are short which may be helpful in those environments where my store is going to be used.

I ended up with 8 indexes:
  1. spog (duplicate checks/copying)
  2. sog (relations between 2 resources)
  3. pog (resource consolidation on IFPs)
  4. gsp (provenance queries)
  5. sg (joins)
  6. pg (resource consolidation, maybe union queries)
  7. og (joins)
  8. o_comp (sorting, keyword searches)

I'm currently working on a new SPARQL2SQL rewriter which is the missing piece between the ARC SPARQL parser's result and querying the store. One or two more blog posts may be needed before I'm done, but I'm making progress.

Archives/Search

YYYY or YYYY/MM
No Posts found

Feeds