finally a bnode with a uri

Posts tagged with: performance

ARC Store design fine-tuning

Getting close to releasing ARC RDF Store
I still haven't released ARC Store as I'm continually discovering optimisation possibilities while working on the SPARQL2SQL rewriter. The latter comes along quite well, just ticked off simple UNIONs, numeric comparisons, grouped OPTIONALS, and nested (yay!) OPTIONALs on my to-do list. I'm currently struggling a little bit with GRAPH queries combined with datasets (restricted via FROM / FROM NAMED) but that's more due to spec reading incapabilities than to mySQL's interpretation of SQL. I'm going to implement a subset of the SPARQL built-ins (e.g. REGEX), but after that the store should finally be usable enough for a first public release.

However, I'm not that used to relational algebra and there are lots of mySQL-specific options, so I frequently used the manual to find out how to e.g. construct SQL UNIONs and LEFT JOINs, or how to make things easier for the query optimizer. I wrote already about RDF store design considerations last month but it looks like there's more room for optimisation:

Shorter index keys

I'm still using CHAR columns for the hashes, but instead of using the hex-based md5 of an RDF term, I'm converting the md5 to a shorter string (without loosing information) now. The CHAR column uses a full byte for each character, but the characters in an md5 string are all from [0-9a-f] (i.e. a rather small 16-character set). Taking the md5 hash as a base-16 number, I can easily shorten it when I use a longer character set. As I said before, PHP can't handle large integers, so I split the md5 string in three chunks, converted each part to an integer, and then re-encoded the result with a different, larger set of characters. I first used
'0123456789 abcdefghijklmnopqrstuvwxyz!?()+,-.@;=[]_{}'
(54 characters) which reduced the overall column size to 23 (-28%). Then I found out that BINARY table columns do case-sensitive matching and may even be faster, so I could change the set to
'0123456789 abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!?()+,-.@;=[]_{}'
(79 chars).

The column size is now 21 (66% of the initial md5). Taking only a sub-portion of the md5 hash (as e.g. done by 3store) could improve things further. This may all sound a little bit desperate (that's at least what mySQL folks said), but as the ARC Store is probably going to be the only SPARQL engine optimised for basic shared web hosting environments, I assume it's worth the extra effort. Note that overall storage space is not (yet) my main concern, it's the size of the indexes used for join operations.

OR instead of UNION

SPARQL UNIONs can't always be translated to SQL ORs (at least I couldn't figure out how), so using SQL's UNION construct is the better way to be compliant. However, for most practical use cases for UNIONs (alternative predicates), a simple WHERE (p='rdfs:label' OR p='foaf:name' OR ...) is much faster than a union. I don't know how to efficiently automate the detection of when to rewrite to ORs, I'll probably have to make that API-only.

Splitting up the table space

I think TAP and Jena offer ways to separate selected statements from the main triple table, thus accelerating certain joins and queries (and saving storage space). I also read about this strategy in a more recent blog post by Chimezie Ogbuji who describes an approach with a dedicated rdf:type table.

The problem with a generic solution is a) to decide how to split up the triples, and b) how to efficiently run queries over the whole set of split tables (e.g. for <foo> ?p ?o patterns).

re a): A table for rdf:type is a reasonable idea, 25% in the datasets I worked with so far were rdf:type statements, with another 10% used by dc:date and foaf:name, but the numbers of FOAF and DC terms are clearly application-specific. In order to speed up joins, it might also be useful to completely separate object2object relations from those relating resources to literals (e.g. in CONFOTO, the latter consume over 40%).

re b): From the little experience I gained so far, I don't expect UNIONs or JOIN/OR combinations to be sufficiently fast. But mySQL has a MERGE storage engine which is "a collection of identical MyISAM tables that can be used as one". This allows efficient queries on selected tables (e.g. for joins, or rdf:type restrictions) and ok-ish performance when the whole set of tables has to be included in a query.

I'm still experimenting, may well be that I only go for the first optimisation in ARC store v0.1.0, but the other ones are surely worth further considerations.

Quad store performance issues

Explaining "Cartesian Catastrophes" in RDF quad stores.
Here are some rough notes (as requested ;) about the issues I ran into with my previous RDF store design. If and when the "combinatorial explosion" problem occurs seems to depend on several factors. It's more likely to happen when
  1. the store is built on top of a relational database system and you try to push SPARQL features to the DB engine
  2. a flat 4-column table layout (subj-pred-obj-graph) instead of an external graph table is used
  3. the graph column is included in the table indexes
  4. only limited memory resources are available and/or the indexed columns require a lot of disk/memory space
  5. your app deals with multiple graphs, esp. when a lot of redundant information from different graphs/sources is added to the knowledge base
  6. queries use ORDER BY, DISTINCT, or a higher OFFSET
  7. there is a reasonable depth in the paths of the queries

Reproducing the problem

Imagine a collaborative photo annotation system where each contribution is added as a mini-graph to a local RDF store. Each of these snippets will probably contain a typing triple (foaf:Image), some literals (title, description) and possibly some "rich" annotations such as person or event depictions, or related skos concepts:
<img1.jpg> a foaf:Image .
<img1.jpg> ... ... .

<img1.jpg> foaf:depicts _:bn1 .
_:bn1 a foaf:Person .
_:bn1 ... ... .

<img1.jpg> foaf:depicts _:bn2 .
_:bn2 a conf:Event .
_:bn2 skos:subject _:bn3 .
_:bn3 skos:prefLabel "Web" .
These little annotation snippets each redundantly create an "img1 is a foaf:Image" triple, and "img1 depicts something" information.

Apart from the SPARQL2SQL rewriting component and the attempt to work around PHP's performance limitations by using mySQL's existing features (DISTINCT, LIMIT, ORDER BY etc), my store layout was derived from the different table layouts of documented RDF stores. They all seem to suggest to use flat 4-column (s-p-o-g) tables for graph-enabled querying. These stores have also successfully been tested with large amounts of data. However, the query response times of my store exploded even with simple queries. The effect was amplified by some rather naive indexes on varchar columns and my requirement to use mySQL's default memory settings (the whole stuff is targeted at hosted webspace environments), but it would have hit me sooner or later anyway.

So, what happens? When you deal with multiple graphs and don't want to handle small and mid-size queries on the application level, s-p-o-g can lead to a lot of redundant s-p-o triples. When you then run a query with a reasonable number of chained patterns (joins), the RDBMS will find many different ways to satisfy the patterns. This may still be ok unless you start sorting or implement pagination. May sound like a rare corner case, but the query that forced me to rethink my store doesn't look too uncommon, and it required just a small number of joins:
SELECT DISTINCT ?img WHERE {
  ?img a foaf:Image .
  ?img foaf:depicts ?event .
  ?event a ical:Vevent .
  ?event skos:subject ?concept .
  ?concept skos:prefLabel "Web" .
  ?img dc:date ?date .
}
ORDER BY DESC(?date)
This query could be optimized as the ?img a foaf:Image doesn't really restrict the result set due to the domain of foaf:depicts. However, this little query needed more than 2 minutes to return just 78 distinct image URLs. I gained ~10 seconds after switching to a fixed table layout (replacing varchars with chars), but the basic problem was that mySQL falls back to complete table scans in certain cases. It created a complete set of possible combinations to satisfy "things that are images that depict an event with a subject with a label 'Web'". A mini store (~25-30K statements), 78 distinct results, 100Ks of possible combinations: combinatorial explosion. And there actually weren't that many annotations per image.

Approaches

3store is provided in two different layouts, Steve Harris said there doesn't seem to be a perfect solution, a separate graph table makes consistency checks during INSERTs more time-consuming. Another possibility is to keep 4 columns but to split the query in sub-queries which are then combined on the app-level. I think many frameworks are working like this but it's unfortunately not an option for my implementation where I have to utilize the RDBMS as much as possible. That (together with usually better designed indexes and without the memory limitations I have) may also be the reason why this problem hasn't been reported in more detail yet. The lack of similar use cases or multi-graph benchmarks with redundant assertions may be another reason.

I've switched now to an "s-p-o" plus external "g" table layout which reduced the response time for the query above to 0.04 seconds, but I guess the Cartesian catastrophe can happen with that design, too, it just needs more patterns. Graph queries are of course more complicated to rewrite to SQL now, and there are new issues that'll need some time to solve.

Archives/Search

YYYY or YYYY/MM
No Posts found

Feeds