finally a bnode with a uri

ARC Store design fine-tuning

Getting close to releasing ARC RDF Store
I still haven't released ARC Store as I'm continually discovering optimisation possibilities while working on the SPARQL2SQL rewriter. The latter comes along quite well, just ticked off simple UNIONs, numeric comparisons, grouped OPTIONALS, and nested (yay!) OPTIONALs on my to-do list. I'm currently struggling a little bit with GRAPH queries combined with datasets (restricted via FROM / FROM NAMED) but that's more due to spec reading incapabilities than to mySQL's interpretation of SQL. I'm going to implement a subset of the SPARQL built-ins (e.g. REGEX), but after that the store should finally be usable enough for a first public release.

However, I'm not that used to relational algebra and there are lots of mySQL-specific options, so I frequently used the manual to find out how to e.g. construct SQL UNIONs and LEFT JOINs, or how to make things easier for the query optimizer. I wrote already about RDF store design considerations last month but it looks like there's more room for optimisation:

Shorter index keys

I'm still using CHAR columns for the hashes, but instead of using the hex-based md5 of an RDF term, I'm converting the md5 to a shorter string (without loosing information) now. The CHAR column uses a full byte for each character, but the characters in an md5 string are all from [0-9a-f] (i.e. a rather small 16-character set). Taking the md5 hash as a base-16 number, I can easily shorten it when I use a longer character set. As I said before, PHP can't handle large integers, so I split the md5 string in three chunks, converted each part to an integer, and then re-encoded the result with a different, larger set of characters. I first used
'0123456789 abcdefghijklmnopqrstuvwxyz!?()+,-.@;=[]_{}'
(54 characters) which reduced the overall column size to 23 (-28%). Then I found out that BINARY table columns do case-sensitive matching and may even be faster, so I could change the set to
'0123456789 abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!?()+,-.@;=[]_{}'
(79 chars).

The column size is now 21 (66% of the initial md5). Taking only a sub-portion of the md5 hash (as e.g. done by 3store) could improve things further. This may all sound a little bit desperate (that's at least what mySQL folks said), but as the ARC Store is probably going to be the only SPARQL engine optimised for basic shared web hosting environments, I assume it's worth the extra effort. Note that overall storage space is not (yet) my main concern, it's the size of the indexes used for join operations.

OR instead of UNION

SPARQL UNIONs can't always be translated to SQL ORs (at least I couldn't figure out how), so using SQL's UNION construct is the better way to be compliant. However, for most practical use cases for UNIONs (alternative predicates), a simple WHERE (p='rdfs:label' OR p='foaf:name' OR ...) is much faster than a union. I don't know how to efficiently automate the detection of when to rewrite to ORs, I'll probably have to make that API-only.

Splitting up the table space

I think TAP and Jena offer ways to separate selected statements from the main triple table, thus accelerating certain joins and queries (and saving storage space). I also read about this strategy in a more recent blog post by Chimezie Ogbuji who describes an approach with a dedicated rdf:type table.

The problem with a generic solution is a) to decide how to split up the triples, and b) how to efficiently run queries over the whole set of split tables (e.g. for <foo> ?p ?o patterns).

re a): A table for rdf:type is a reasonable idea, 25% in the datasets I worked with so far were rdf:type statements, with another 10% used by dc:date and foaf:name, but the numbers of FOAF and DC terms are clearly application-specific. In order to speed up joins, it might also be useful to completely separate object2object relations from those relating resources to literals (e.g. in CONFOTO, the latter consume over 40%).

re b): From the little experience I gained so far, I don't expect UNIONs or JOIN/OR combinations to be sufficiently fast. But mySQL has a MERGE storage engine which is "a collection of identical MyISAM tables that can be used as one". This allows efficient queries on selected tables (e.g. for joins, or rdf:type restrictions) and ok-ish performance when the whole set of tables has to be included in a query.

I'm still experimenting, may well be that I only go for the first optimisation in ARC store v0.1.0, but the other ones are surely worth further considerations.

Comments are disabled for this post.

Later Posts

Archives/Search

YYYY or YYYY/MM
No Posts found

Feeds