finally a bnode with a uri

Quad store performance issues

Explaining "Cartesian Catastrophes" in RDF quad stores.
Here are some rough notes (as requested ;) about the issues I ran into with my previous RDF store design. If and when the "combinatorial explosion" problem occurs seems to depend on several factors. It's more likely to happen when
  1. the store is built on top of a relational database system and you try to push SPARQL features to the DB engine
  2. a flat 4-column table layout (subj-pred-obj-graph) instead of an external graph table is used
  3. the graph column is included in the table indexes
  4. only limited memory resources are available and/or the indexed columns require a lot of disk/memory space
  5. your app deals with multiple graphs, esp. when a lot of redundant information from different graphs/sources is added to the knowledge base
  6. queries use ORDER BY, DISTINCT, or a higher OFFSET
  7. there is a reasonable depth in the paths of the queries

Reproducing the problem

Imagine a collaborative photo annotation system where each contribution is added as a mini-graph to a local RDF store. Each of these snippets will probably contain a typing triple (foaf:Image), some literals (title, description) and possibly some "rich" annotations such as person or event depictions, or related skos concepts:
<img1.jpg> a foaf:Image .
<img1.jpg> ... ... .

<img1.jpg> foaf:depicts _:bn1 .
_:bn1 a foaf:Person .
_:bn1 ... ... .

<img1.jpg> foaf:depicts _:bn2 .
_:bn2 a conf:Event .
_:bn2 skos:subject _:bn3 .
_:bn3 skos:prefLabel "Web" .
These little annotation snippets each redundantly create an "img1 is a foaf:Image" triple, and "img1 depicts something" information.

Apart from the SPARQL2SQL rewriting component and the attempt to work around PHP's performance limitations by using mySQL's existing features (DISTINCT, LIMIT, ORDER BY etc), my store layout was derived from the different table layouts of documented RDF stores. They all seem to suggest to use flat 4-column (s-p-o-g) tables for graph-enabled querying. These stores have also successfully been tested with large amounts of data. However, the query response times of my store exploded even with simple queries. The effect was amplified by some rather naive indexes on varchar columns and my requirement to use mySQL's default memory settings (the whole stuff is targeted at hosted webspace environments), but it would have hit me sooner or later anyway.

So, what happens? When you deal with multiple graphs and don't want to handle small and mid-size queries on the application level, s-p-o-g can lead to a lot of redundant s-p-o triples. When you then run a query with a reasonable number of chained patterns (joins), the RDBMS will find many different ways to satisfy the patterns. This may still be ok unless you start sorting or implement pagination. May sound like a rare corner case, but the query that forced me to rethink my store doesn't look too uncommon, and it required just a small number of joins:
SELECT DISTINCT ?img WHERE {
  ?img a foaf:Image .
  ?img foaf:depicts ?event .
  ?event a ical:Vevent .
  ?event skos:subject ?concept .
  ?concept skos:prefLabel "Web" .
  ?img dc:date ?date .
}
ORDER BY DESC(?date)
This query could be optimized as the ?img a foaf:Image doesn't really restrict the result set due to the domain of foaf:depicts. However, this little query needed more than 2 minutes to return just 78 distinct image URLs. I gained ~10 seconds after switching to a fixed table layout (replacing varchars with chars), but the basic problem was that mySQL falls back to complete table scans in certain cases. It created a complete set of possible combinations to satisfy "things that are images that depict an event with a subject with a label 'Web'". A mini store (~25-30K statements), 78 distinct results, 100Ks of possible combinations: combinatorial explosion. And there actually weren't that many annotations per image.

Approaches

3store is provided in two different layouts, Steve Harris said there doesn't seem to be a perfect solution, a separate graph table makes consistency checks during INSERTs more time-consuming. Another possibility is to keep 4 columns but to split the query in sub-queries which are then combined on the app-level. I think many frameworks are working like this but it's unfortunately not an option for my implementation where I have to utilize the RDBMS as much as possible. That (together with usually better designed indexes and without the memory limitations I have) may also be the reason why this problem hasn't been reported in more detail yet. The lack of similar use cases or multi-graph benchmarks with redundant assertions may be another reason.

I've switched now to an "s-p-o" plus external "g" table layout which reduced the response time for the query above to 0.04 seconds, but I guess the Cartesian catastrophe can happen with that design, too, it just needs more patterns. Graph queries are of course more complicated to rewrite to SQL now, and there are new issues that'll need some time to solve.

Comments are disabled for this post.

Earlier Posts

Later Posts

Archives/Search

YYYY or YYYY/MM
No Posts found

Feeds