Code.semsol.org - A central home for semsol code

Semsol gets code repositories and browsers
The code bundles on the ARC website are generated in an inefficient manual process, and each patch has to wait for the next to-be-generated zip file. The developer community is growing (there are now 600 ARC downloads each month), I'm increasingly receiving patches and requests for a proper repository, and the Trice framework is about to get online as well. So I spent last week on building a dedicated source code site for all semsol projects at code.semsol.org.

So far, it's not much more than a directory browser with source preview and a little method navigator. But it will simplify code sharing and frequent updates for me, and hopefully also for ARC and Trice developers. You can checkout various Bazaar code branches and generate a bundle from any directory. The app can't display repository messages yet (the server doesn't have bzr installed, I'm just deploying branches using the handy FTP option), but I'll try to come up with a work-around or an alternative when time permits.

Code Browser

CommonTag too complicated?

Not sure if the commontag effort sends the right message.
Update: I just read the spec again, I can't tag non-content with the CommonTag vocabulary. Too bad, ignore the last paragraph, please.

Sorry for raising my voice here, but some of us are really working hard to show that SemWeb technologies don't have to be complicated, and unfortunately, the new CommonTag effort seems to send exactly the opposite message.

Don't get me wrong, a widely used tagging ontology would be great. We do have 3 (or 4? 5?) tagging vocabularies already, but none really caught up, possibly because tagging is meant to be simple and the proposed solutions apparently weren't easy enough. CommonTag is promoted as being "simple" and "easy", but after looking at the examples in the QuickStart Guide, I'm not so sure:
  • The snippets are really off-putting (not only for Non-RDFers). Do I really need multiple nested HTML nodes to create something as simple as a tag?
  • Couldn't the term names be more intuitive? What could a ctag:Tag be? The actual tag or an intermediate resource that is then, err, tagged? A person ctag:tagged a resource, right? Ah, no.
  • Why aren't the term names at least consistent? "ctag:taggingDate" follows noun-role, "ctag:tagged" is a dunno, "ctag:means" is a present-form verb, "ctag:isAbout" sort-of follows the hasPropertyOf anti-pattern.
  • The vocabulary introduces aliases for well-deployed terms such as rdfs:label and dct:created, which makes its use in practical settings expensive (it'll ease things on the author side, though).

To be a little more constructive: Using the vocabulary doesn't have to lead to the complicated markup seen in the examples. I'm sure they'll soon get better snippets from someone in the RDFa community. And apart from that, there is also a handy term in the RDF Schema which might just be what you are looking for: "ctag:isAbout". It lets you directly point from a resource (default is the page) to a Linked Data identifier (e.g. from DBPedia), without the need for all those intermediate nodes (which lead to triple bloat and slow down SPARQL queries). CommonTag-consuming apps will have to implement some form of inferencing to handle "isAbout", but as the term is in the spec, I assume they plan to.

Granular modeling of tags is apparently tricky, but shouldn't there be some sweet spot? Something a little more expressive than rel-tag but less complex than a fully spec'd Tag ontology? xFolk looks promising, or maybe the CommonTag group members could have agreed on formalizing and supporting "scoped rel-tag" (rel-tags with an optional RDFa "about" container). Most rel-tag-to-RDF converters have some form of scoping already anyway (because tags can apply to reviews, pages, vcards, etc.). That would have been a cool outcome after 1 year of stealth work.

I may as well just over-stress the simplicity aspect here. Maybe CommonTag is "simple enough" for web publishers. There are some initial supporters, and for RDFers, the nested structures and bnodes will most probably be acceptable. So let's see how things evolve.

I personally think I'll have a closer look at ctag:isAbout. I'm still looking for an alternative to dc/dct:subject to tag arbitrary things with arbitrary identifiers, maybe CommonTag can provide it, although
<#me> ctag:isAbout dbpedia:Semantic_Web .
still doesn't sound right for a rich tag, and the domain is "ctag:TaggedContent" which sounds wrong for non-textual resources, too. (dct:relation is the best I could find so far for tagging things with things, but Dublin Core is coming from a publishing context and is therefore often recommended for describing publications only).

ESWC 2009 Linked Data Dashboards

A first Paggr application went live during ESWC2009.
In case you missed the tweets or a local announcement: The first Paggr application went online a few days ago. This year's ESWC Technologies Team pushed things a little further, with RFID tracking during the event and extended conference data that includes detailed session and date/time information (kudos to Michael Hausenblas for RDFizing even PDFs).

Based on this dataset, we provided a conference explorer and stress-tested the "Dog Food" server while at it. The system survived, but I also learned a lot. We used about 50 RDF stores for the different public and user-specific dashboards, which basically worked nicely. However, rendering non-ugly resource summaries requires a bit of endpoint hammering, and some of the more complex path queries resulted in timeouts. Yesterday, I had to create a mirror from the data dump to route a couple of widgets through a replicated (ARC :-) endpoint. But then this is also one of the powerful possibilities that come with semantic web technologies. You can often switch or double the back-end repository in no time, and without any code changes. (And as all the Sparqlets are created in a web-based tool, I didn't even have to upload a changed configuration file. I simply tweaked a SPARQLScript parameter.)

Anyway, there are a couple of public dashboards, in case you'd like to give it a try (it's still an early version), I also embedded a short screencast below. The system is going to be moved to a DERI server when the conference is over, but the URIs and data will probably stay stable. (And no, it won't really work with IE yet.) More to come!



HQ version (quicktime, 110MB)

Simple RDFication of SPARQL SELECT results with RDFa

How to use RDFa to make SELECT results locally available as RDF
A couple of weeks ago, I've written about the self-enforcing value spiral that RDF data enables. Here is an example about how RDFa can be used to support this "Repurpose-Republish" loop.

While data exchange between different semantic web sources is usually RDF-based (i.e. the data always maintain their semantics), there is one major exception: SPARQL SELECT queries. This developer-oriented operation returns tabular data (similar to record sets in SQL). Once the query result is separated from the query, the associated structural data is lost. You can't directly feed SELECT results back into a triple store, even though querying based on linked resources means that you have just created knowledge. It's a pity to show this generated information to human consumers only.

One of the demos at my NYC talk was a dynamic wiki item that pulled in competitor information from Semantic CrunchBase and injected that into a page template as HTML. The existing RDF infrastructure does not let me cache the SELECT results locally as usable RDF. And a semantic web client or crawler that indexes the wiki page will not learn how the described resource (e.g. Twitter) is related to the remote, linked entities.

wiki with linked data

However, by simply adding a single RDFa hook to the wiki item template, the RDF relation (e.g. competitor) can be made available again to apps that process my site content. This is basically how Linked Data works. But here is the really nifty thing: My site can be a consumer of its own pages, too, recursively enriching its own data.

markup-to-SELECT-to-RDFa-to-RDF

I tweaked the wiki script which now works like this: When the page is saved, a first operation updates the wiki markup in the page's graph (i.e. the not-yet-populated template). In a second step, the page URL is retrieved via HTTP. This will return HTML with RDFa-encoded remote data, which is then parsed by ARC, and finally added to the same graph. We end up with a graph that does not only contain the wiki markup, but also the RDFized information that was integrated from remote sites. After adding this graph to the RDF store, we can use a local query to generate the page and occasionally reset the graph to enable copy-by-reference. And all this without any custom API code.

rdfa-to-sparql

Back from New York "Semantic Web for PHP Developers" trip

Gave a talk and a workshop in NYC about SemWeb technologies for PHP developers
/me at times square I'm back from New York, where I was given the great opportunity to talk about two of my favorite topics: Semantic Web Development with PHP, and (not necessarily semantic) Software Development using RDF Technology. I was especially looking forward to the second one, as that perspective is not only easier to understand for people from a software engineering context, but also because it is still a much neglected marketing "back-door": If RDF simplifies working with data in general (and it does), then we should not limit its use to semantic web apps. Broader data distribution and integration may naturally follow in a second or third step once people use the technology (so much for my contribution to Michael Hausenblas' list of RDF MalBest Practices ;)

The talk on Thursday at the NY Semantic Web Meetup was great fun. But the most impressive part of the event were the people there. A lot to learn from on this side of the pond. Not only very practical and professional, but also extremely positive and open. Almost felt like being invited to a family party.

The positive attitude was even true for the workshop, which I clearly could have made more effective. I didn't expect (but should have) that many people would come w/o a LAMP stack on their laptops, so we lost a lot of time setting up MAMP/LAMP/WAMP before we started hacking ARC, Trice, and SPARQL.

Marco brought up a number of illustrating use cases. He maintains an (inofficial, sorry, can't provide a pointer) RDF wrapper for any group on meetup.com, so the workshop participants could directly work with real data. We explored overlaps between different Meetup groups, the order in which people joined selected groups, inferred new triples from combined datasets via CONSTRUCT, and played with not-yet-standard SPARQL features like COUNT and LOAD.

And having done the workshop should finally give me the last kick to launch the Trice site now. The code is out, and it's apparently not too tricky to get started even when the documentation is still incomplete. Unfortunately, I have a strict "no more non-profits" directive, but I think Trice, despite being FOSS, will help me get some paid projects, so I'll squeeze an official launch in sometime soon-ish.

Below are the slides from the meetup. I added some screenshots, but they are probably still a bit boring without the actual demos (I think a video will be put up in a couple of days, though).

Could Microdata work better for me than RDFa?

Just had a quick look at the Microdata proposal, wondering about its pros and cons.
I've always had my little issues with RDFa, mainly for personal reasons. I'm repeating them here (for the last time, promised, don't want to trigger another flame war):
  • I personally don't like the amount of new attributes and their names (about, resource, typeof, and property are at least as inconsistent as RDF/XML's tokens).
  • I've written an RDFa parser, but still don't really understand the processing model. RDFa does the job of course, and it's been specified by smart people I respect, but to me it just still feels a little too complicated. I often have to utilize an extraction service to verify the triples resulting from a snippet, and I've seen the creators of RDFa do the same.
    One reason for being less intuitive than hoped is the fact that adding an attribute to some existing snippet can easily change the entire meaning of nested information. This makes it tricky to incrementally add structure to already tested and approved RDFa (an unnoticed @rel or @typeof may add an unwanted blank intermediate node, for example, and you can have any combination of RDFa attributes on a single node).
  • I consider structured blogging a central use case for RDF in HTML, yet it's not fully supported by RDFa: RDFa does not allow sub-structures in XML Literals (for security/triple injection reasons, IIRC), so you can't extract a post body (including HTML markup) and also get the annotations encoded in the body (like reviews or events).
  • (Reliable) copy and paste is not possible when prefix definitions can be kept separate from annotations. This is relevant to some of the apps I'm working on, and it took me quite some time to admit that (intuitively desirable) URI abbreviations in HTML do have negative practical implications. It depends on the use case, but it also needs some experience to realize this, as the pro-prefix argument is practically motivated as well. (I started playing with RDF-ish copy & paste rather early, if that makes this conclusion more credible).
  • The xmlns:prefix mechanism doesn't work nicely with my development environment. This is perhaps a silly argument, but for me personally it is important to see that green little "0 errors" indicator in my browser while I'm creating sites. It was not hard to extend the Firefox validator extension with support for new attributes, but there was no clean way to make it accept xmlns:prefix. Spotting true errors in the dozens of RDFa-related complaints is annoying.

Having said that, if this little list is all I can come up with, then RDFa is probably a pretty solid and usable spec. I could easily write a list of things I find flawed in RDF/XML, or even SPARQL, my favorite RDF technology. And there is another good reason why I should tend towards using RDFa: Lack of proper alternatives. I still think it would be possible to create a cross-doctype solution. eRDF and my own poshRDF experiment show that it's possible, but so far these approaches are incomplete RDF-wise, and I wouldn't have the energy or funds to build a community to develop things further (and again, my arguments are motivated by personal use cases and habits, so there isn't a large overlap with other people's requirements anyway).

Nevertheless, the new "Microdata" proposal is currently being discussed, so it might be worth having a look and comparing it with my RDFa issue list above. I only had a quick scan, I may have gotten some details wrong:
  • It only introduces two new (mandatory) attributes: "item" and "itemprop". "item" can be used to type resources. RDFa's "about" can be re-used for URI-identified items. That sounds compact and neat so far.
  • "item" is mandatory to indicate the boundary of a resource description. This makes accidental triples much less likely to happen than with RDFa. For any "itemprop", you just have to walk up the DOM tree to find the container item, which makes both human- and code-based parsing easy.
  • Structured blogging?Aww, not really. While you can at least choose between raw markup or structured values in RDFa, Microdata only supports flat key-value pairs where the value is a node's textContent and won't contain tags (if I read the draft correctly). I don't really need datatypes and languages, but I definitely want RDF triples where the object can contain HTML markup (wiki blobs with embedded annotations are another example).
  • Copy & paste of source code or from/to contenteditable sections is more reliable than with RDFa because there is no prefix mechanism.
  • It'd be possible to make the Firefox validator eat the new Microdata attributes without complaining, but I'm not sure how likely it is to have Microdata support in the official distribution anytime soon. Marc Gueury writes that validating HTML5 may require a new sort of validator, switching to HTML5 may make things worse instead of better for me, development-wise.

I recently watched a short section of a TV fortune-teller show where desperate people could dial in to get their questions asked. The lady who called asked "Will I find a new love?", and the fortune-teller looked into her cards (very slowly, of course, given the 3 EUR/minute rate), then slowly lifted her head, looked straight into the camera and articulated her findings: "I see a definite Maybe."

I guess this awesome universal answer also works for my opening question. There simply is no ideal solution. I like the item/itemprop idea, but I'd need to add a hack for markup values (e.g. by adding a item="...XMLLiteral" container and then converting these items to XML nodes. But then I can just add a simpler hack to my RDFa extractor to deep-parse XMLLiterals). This doesn't justify a whole new spec. The copy/paste problem is not too urgent any more, as Linked Data enables nifty copy-by-reference instead of copy-by-value.

It's generally a little surprising to see that Microdata proposal. For months, the HTML5 opinion makers argued against user-defined markup structures, and now they created a completely new spec that not only extends RDFa's possibilities to identify resource types and relations, but also seems to introduce a redundant serialization for selected microformats.

Anyway, for the sake of convergence and less work, I think I still prefer (a subset of) RDFa, if only there was a way to get rid of CURIEs (who wants an abbreviation mechanism whose acronym can't even be properly expanded? ;). And an alternative for the validation pain could be a simple, locally installed validator, accessible through a Ubiquity script. When I think about it, I mainly just need well-formedness and some attribute checks. A Ubiquity script could directly show HTML errors and also extracted triples, and maybe even do some triple sanity checks, too. But then this setup would work for Microdata just as fine. Ah well..

Paggr screencast: Conference Explorer (proto)

Prototype screencast of a semantic conference explorer for ESWC 2009.
I just returned from a short, doc-enforced trip to Nice (awesome place, savoir-vivre and all that) and will fly to the NYC SemWeb Meetup in a few days. Before we went to France, I created another Paggr screencast. This one is the first to show the (user-facing) dashboard and widgets we plan to make available as a semantic conference explorer at ESWC 2009. Still some way to go, but I'm optimistic that we'll have a number of handy helpers online by the beginning of the event. I won't be able to attend in person, so I'm highly motivated to have at least a twitter and twitpic tracker up and running then.



HQ version (quicktime, 134MB)

ARC Graph Gear Serializer Plugin

Patrick Murray-John created an ARC2 converter for Graph Gear visualizations
Patrick Murray-John (who is currently Semantifying the University of Mary Washington) just released a first version of an ARC2 converter for Graph Gear visualizations. Looks pretty cool.
Graph Gear visualization from RDF via ARC

RDF/SPARQL-based web development for PHP coders: Meetup presentation and workshop in NYC

I'll give a talk and run a workshop in New York City in May.
The Linked Data meme is spreading and we have strong indications that web developers who understand and know how to apply practical semantic web technologies will soon be in high demand. Not only in enterprise settings but increasingly for mainstream and agency-level projects where scripting languages like PHP are traditionally very popular.

I can't really afford travelling to promote the interesting possibilities around RDF and SPARQL for PHP coders, so I'm more than happy that Meetup master Marco Neumann offered me to come over to New York and give a talk at the Meetup on May 21st. Expect a fun mixture of "Getting started" hints, demos, and lessons learned. In order to make this trip possible, Marco is organizing a half-day workshop on May 22nd, where PHP developers will get a hands-on introduction to essential SemWeb technologies. I'm really looking forward to it (and big thanks to Marco).

So, if you are a PHP developer wondering about the possibilities of RDF, Linked Data & Co, come to the Meetup, and if you also want to get your hands dirty (or just help me pay the flight ticket ;) the workshop could be something for you, too. I'll arrive a few days earlier, by the way, in case you want to add another quaff:drankBeerWith triple to your FOAF file ;)

Paggr article in Nodalities Magazine 6

The latest NodMag issue features an article about Paggr.
Talis' new Nodalities Mag is now available online (and the print version is on its way to subscribers). This issue contains six semantic web articles, including one about Paggr:
  • Linking Data and Semantics at O'Reilly - Gavin Carothers and Charles Greer tell O'Reilly Media's Linked Data story.
  • Discovering SPARQL - Alex Tucker exposes SPARQL endpoints via Bonjour.
  • Linked Data In(ter)Action - Benjamin Nowack discusses Paggr.
  • Introducing: STI International
  • Social Semantic Web Scales in the Cloud - Simon Schenk discusses SemaPlorer
  • Streams, Pools and Reservoirs - Leigh Dodds explores flowing data

Semantic web apps to simplify my life

A wish list for the semantic web
Heh, quick update after heated discussions on IRC: I know that there are non-RDF apps as well as RDF apps for each of the items below. What I actually want, however, are solutions that look and feel like modern Web apps (hence "simple and beautiful"), but still provide things such as RDF data exchange and SPARQL access. And these apps don't really exist yet. I admit that it's apparently a real challenge for us RDFers to build them, due to our inner-platform tendencies, but I hope that we'll get there once we realize that we can combine our agile, generic backends with task-optimized front-ends.

Update 2: Have a look at loomp.org. These guys are doing great stuff following a "Simplicity is key" approach.


A short list of apps that I'd love to see for a more streamlined life/workflow:

A simple, beautiful, semwebby, linked data-enabled ...
  • ... feed reader
  • ... issue tracker / todo app (one setup for all my projects)
  • ... wiki (for notes, ideas, structured data)
  • ... address book
  • ... calendar
  • ... email inbox (with a bot that removes junk based on SPARQL rules)
  • ... lifelog (private posts, project posts, status updates, location changes)
  • ... online profile generated from all my data
  • ... browser-based system to explore and display the integrated information from my data apps
  • ... alert tool for selected topics/discussions on Twitter, IRC, and mailing lists
  • ... photo organizer

Some of my development work is probably in line with this roadmap, but until now I was more in the "Breadth-first" camp, often moving to the next interesting exercise once I had an initial proof of concept. Switching to "Depth-first" could already simplify my life a lot. Focusing on a smaller number of projects would not only cut down the amount of low-activity projects and parallel todo items, but should also allow me to release more stable and market-ready products in less time.

ARC now also GPL-licensed

ARC is now available under the W3C Software or the GPL license
Arto Bendiken and Stéphane Corlosquet asked me to provide ARC also under the GPL (for Drupal, in addition to the current W3C Software License), so here you are.

ARC is already used by several modules that help turn Drupal into an RDF-powered CMS, for example the RDF API, the SPARQL extension, or the Calais module. The new license will make it easier for the Drupal community to directly bundle ARC with their RDF extensions. I guess that Drupal will have its own complete RDF toolkit one day, but it's great to see ARC being utilized for accelerating the development progress.

Paggr screencast: Linked Data Widget Builder

A screencast about Paggr's sparqlet builder.
Running an R&D-heavy agency in the current economical climate is pretty tough, but there are also a couple of new opportunities for these semantic solutions that help reduce costs and do things more efficiently. I'm finally starting to get project requests that include some form of compensation. Not much yet (all budgets seem to be very tight these days), but it's a start, and together with support from Susanne, I could now continue working on Paggr, semsol's Netvibes-like dashboard system for the growing web of Linked Data.

An article about Paggr will be in the next Nodalities Magazine, and the ESWC2009 technologies team is considering a custom system for attendees which is a great chance to maybe get other conference organizers interested. (I see much potential in a white-label offering, but a more mainstream-ish version for Web 2.0 data is still on my mind. Just have to focus on getting self-sustained first.)

Below is a short screencast that demonstrates a first version of the sparqlet (= semantic widget) builder. I've de-coupled sparqlet-serving from the dashboard system, so that I'll be able to open-source the infrastructure parts of Paggr more easily. Another change from the October prototype is the theme-ability of both dashboards and widget servers. Lots of sun, sky, and sea for ESWC ;-)



HQ version (quicktime, 120MB)

New ARC version (DB changes!)

ARC revision 2009-03-04 introduces some low-level changes.
I've just uploaded a new ARC revision (rev 2009-03-04). In preparation of a Store Optimizer (which will improve the RDF store's scalability and performance), I slightly changed the underlying MySQL table structure. Please backup your data before you upgrade. Here's sample code, it's not too complicated:
/* old ARC version */
$store->createBackup($backup_path); // make sure the directory is write-enabled
// on success: $store->drop();
/* new ARC */
$store->query('LOAD <' . $backup_path . '>');
(Note: Store settings, if used, are not part of the dump and have to be manually copied over)

Main changes in this version:
  • the (crappy) inferencer was removed, I'll add a rule-based system at some later stage
  • the store indexes can be defined via a "store_indexes" config option now. Default: array('sp (s,p)', 'os (o,s)', 'po (p,o)'). The previous stores had an additional index 'spo (s,p,o)'.
  • The store got an "extendColumns" method which changes the column types from MEDIUMINT to INT. You don't have to call this method explicitly, the tables will be auto-upgraded should your store reach ARC's previous 16M triples limit.
  • There is a new "store_write_buffer" config option (default: 2500, changed from 5000). This option let's you set the batch size of triples written to the MySQL tables. In certain situations, esp.with shared hosts or large literal objects, the "5000" was too much and led to MySQL rejecting the queries.
  • the toTurtle and toRDFXML methods (and associated methods in the
    Serializers) accept a 2nd "raw" parameter now, in case you don't want a full RDF document, but just the triples. (thx to Claudia Wagner for the suggestion)

If you have questions, just send them to the mailing list and I'll try to help.

Semantics @ SIMsKultur Online

SIMsKultur Online is adding semantics
Exciting times, it really looks like we are about to witness RDF's tipping point. Every other week we see another service adding semantic web support. I didn't even find time to play with O'Reilly's RDF data yet, and yesterday I already came across the next site: SIMsKultur not only added RDF export for all events (more info at evo42), but also put up a hacked smesher instance to enrich and filter their Tweets (work in progress). I've been told that even SPARQL support is on their list.

smesher @ SIMsKultur

This is exactly the stuff I was dreaming of when I started with RDF development: Web agencies enhancing their customers' experience with easy-to-deploy solutions. I didn't expect it to become such a marathon, and we're still not fully there yet, but it feels a lot like we're finally hitting the home stretch :-)