finally a bnode with a uri

Posts tagged with: rdf

Schema.org - Threat or Opportunity?

Some thoughts about the impact of Schema.org
I only wanted to track SemTech chatter but it seems all semantics-related tweet streams are discussing just one thing right now: Schema.org. So I apparently will have to build a #semtech filtering app, but I couldn't resist and had a close look at Schema.org, too. And just like everybody else, I'll join the fun of polluting the web with yet another opinion about its potential impact on the Semantic Web initiative and related efforts.

What exactly is Schema.org?

  • It is a list of instructions for adding structured data to HTML pages.
  • Webmasters can choose from a long, but finite list of types and properties.
  • Data-enhanced web pages trigger richer displays in Google/Bing/Yahoo search result pages.

Why the uproar?

  • Schema.org proposes the use of Microdata, a rather new RDF format that was not developed by the RDF community.
  • Schema.org introduces a new vocabulary which doesn't re-use terms from existing RDF schemas.

Who can benefit from it?

  • The web, because the simple template-like instructions on schema.org will boost the amount of structured data, similar to Facebook's Open Graph Protocol.
  • The semantic web market, by offering complementing as well as extending/competing solutions.
  • SEO people, because they can offer their service with less effort.
  • Website owners, who can more reliably customize their search engine displays and increase CTRs.
  • Possibly HTML5 (doctype) deployment, because the supported structures are based on HTML5's Microdata.
  • Verticals around popular topics (Music, Food, ...) because the format shakeout will make their parser writers' lifes easier.
  • Verticals who manage to successfully establish a schema.org extension (e.g. Job Offers).
  • The search engine companies involved, because extracting (known) structures can be less expensive and more accurate than NLP and statistical analysis. Controlling the vocabulary also means being able to tailor it to semantic advertising needs, integrating the schema.org taxonomy with AdWords would make a lot of (business) sense. And finally, the search engines can more easily generate their own verticals now (as Google has already successfully done with shopping and recipe browsers), making it harder for specialized aggregators to gain market share.
  • Spammers, unless the search engines manage to integrate the structured markup with their exisitng stats-based anti-spam algorithms.

Who might be threatened and how could they respond?

  • Microformats and overlapping RDF vocabularies such as FOAF (unlikely) or GoodRelations, which Schema.org already calls "earlier work". Even if they continue to be supported for the time being, implementers will switch to schema.org vocabulary terms. One opportunity for RDF schema providers lies in grounding their terms in the schema.org taxonomy and highlighting use cases beyond the simple SEO/Ad objectives of Schema.org. RDF vocabs excel in the long tail, and there are many opportunities left (especially for non-motorcycle businesses ;-). This will best work out if there are finally going to be applications that utilize these advanced data structures. If the main consumers continue to be search engines, there is little incentive to invest in higher granularity.
  • The RDFa community. They think they are under attack here, and I wonder if Manu is overreacting perhaps? Hey, if they had listened to me they wouldn't have this problem now, but they had several reasons to stick to their approach and I don't think these arguments get simply wiped away by Schema.org. They may have to spend some energy now on keeping Facebook on board, but there are enough other RDFa adopters that they shouldn't be worried too much. And, like the RDF vocab providers, they should highlight use cases beyond SEO. The good news is that potential spam problems, which are more likely to occur in the SEO context, will now get associated with Microdata, not RDFa. And the Schema.org graph can be manipulated by any site owner while Facebook's interest graph is built by authenticated users. Maybe the RDFa community shouldn't have taken the SEO train in the first place anyway. Now Schema.org simply stole the steam. After all, one possible future of the semantic web was to creatively destroy centralized search engines, and not to suck up to them. So maybe Schema.org can be interpreted as a kick in the back to get back on track.
  • The general RDF community, but unnecessarily so. RDFers kicked off a global movement which they can be proud of, but they will have to accept that they no longer dictate how the semantic web is going to look like. Schema.org seems to be a syntax fight, but Microdata maps nicely to RDF, which RDFers often ignore (that's why schema.rdfs.org was so easy to set up). The real wakeup call is less obvious. I'm sure that until now, many RDFers didn't notice that a core RDF principle is dying. RDFers used to think that distinct identifiers for pages and their topics are needed. This assumption was already proved wrong when Facebook started their page-based OGP effort. Now, with Schema.org's canonical URLs, we have a second, independent group that is building a semantic web truly layered on top of the existing web, without identifier mirrors (and so far without causing any URI identity crisis). This evolving semantic web is closer to the existing web than the current linked data layer, and probably even more compatible with OWL, too. There is a lot we can learn. Instead of becoming protective, the RDF community should adapt and simplify their offerings if they want to keep their niches relevant. Luckily, this is already happening, as e.g. the Linked Data API demonstrates. And I'm very happy to see Ivan Herman increasingly speaking/writing about the need to finally connect web developers with the semantic web community.
  • Early adopters in the CMS market. Projects like Drupal and IKS have put non-trivial resources into integrating machine-readable markup, and most of them are using RDFa. Microdata, in my experience, is easier to tame in a CMS than RDFa, especially when it comes to JavaScript operations. But whether semantic CMSs should add support for (or switch to) Schema.org microdata and their vocabulary depends more on whether they want/need to utilize SEO as a (short-term) selling proposition. Again, this will also depend on application developer demands.

What about Facebook?

Probably the more interesting aspect of this story, what will Facebook do? Their interest graph combined with linked data has big potential, not only for semantic advertising. And Facebook is interested in getting as many of their hooks into websites as possible. Switching to Microdata and/or aligning their types with Schema.org's vocabulary could make sense. Webmasters would probably welcome such a consolidation step as well. On the other hand, Facebook is known for wanting to keep things under their own control, too, so the chance of them adopting Schema.org and Microdata is rather low. This could well turn into an RSS-dejavu with a small set of formats (OGP-RDFa, full RDFa, Schema.org-Microdata, full Microdata) fighting for publisher and developer attention.

Conclusion

I am glad that Microdata finally gets some deserved attention and that someone acknowledged the need for a format that is easy to write and to consume. Yes, we'll get another wave of "see, RDF is too complicated" discussions, but we should be used to them by now. I expect RDF toolkits to simply integrate Microdata parsers soon-ish (if we're good at one thing then it's writing parsers), and the Linked Data community gets just another taxonomy to link to. Schema.org owns the SEO use case now, but it's also a nice starting point for our more distributed vision. The semantic web vision is bigger than data formats and it's definitely bigger than SEO. The enterprise market which RDF has mainly been targetting recently is a whole different beast anyway. No kittens killed. Now go build some apps, please ;-)

Linked Data Entity Extraction with Zemanta and OpenCalais

A comparison of the NER APIs by Zemanta and OpenCalais.
I had another look at the Named Entity Extraction APIs by Zemanta and OpenCalais for some product launch demos. My first test from last year concentrated more on the Zemanta API. This time I had a closer look at both services, trying to identify the "better one" for "BlogDB", a semi-automatic blog semantifier.

My main need is a service that receives a cleaned-up plain text version of a blog post and returns normalized tags and reusable entity identifiers. So, the findings in this post are rather technical and just related to the BlogDB requirements. I ignored features which could well be essential for others, such as Zemanta's "related articles and photos" feature, or OpenCalais' entity relations ("X hired Y" etc.).

Terms and restrictions of the free API

  • The API terms are pretty similar (the wording is actually almost identical). You need an API key and both services can be used commercially as long as you give attribution and don't proxy/resell the service.
  • OpenCalais gives you more free API calls out of the box than Zemanta (50.000 vs. 1.000 per day). You can get a free upgrade to 10.000 Zemanta calls via a simple email, though (or excessive API use; Andraž auto-upgraded my API limit when he noticed my crazy HDStreams test back then ;-).
  • OpenCalais lets you process larger content chunks (up to 100K, vs. 8K at Zemanta).

Calling the API

  • Both interfaces are simple and well-documented. Calls to the OpenCalais API are a tiny bit more complicated as you have to encode certain parameters in an XML string. Zemanta uses simple query string arguments. I've added the respective PHP snippets below, the complexity difference is negligible.
    function getCalaisResult($id, $text) {
      $parms = '
        <c:params xmlns:c="http://s.opencalais.com/1/pred/"
                  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
          <c:processingDirectives
            c:contentType="TEXT/RAW"
            c:outputFormat="XML/RDF"
            c:calculateRelevanceScore="true"
            c:enableMetadataType="SocialTags"
            c:docRDFaccessible="false"
            c:omitOutputtingOriginalText="true"
            ></c:processingDirectives>
          <c:userDirectives
            c:allowDistribution="false"
            c:allowSearch="false"
            c:externalID="' . $id . '"
            c:submitter="http://semsol.com/"
            ></c:userDirectives>
          <c:externalMetadata></c:externalMetadata>
        </c:params>
      ';
      $args = array(
        'licenseID' => $this->a['calais_key'],
        'content' => urlencode($text),
        'paramsXML' => urlencode(trim($parms))
      );
      $qs = substr($this->qs($args), 1);
      $url = 'http://api.opencalais.com/enlighten/rest/';
      return $this->getAPIResult($url, $qs);
    }
    
    function getZemantaResult($id, $text) {
      $args = array(
        'method' => 'zemanta.suggest',
        'api_key' => $this->a['zemanta_key'],
        'text' => urlencode($text),
        'format' => 'rdfxml',
        'return_rdf_links' => '1',
        'return_articles' => '0',
        'return_categories' => '0',
        'return_images' => '0',
        'emphasis' => '0',
      );
      $qs = substr($this->qs($args), 1);
      $url = 'http://api.zemanta.com/services/rest/0.0/';
      return $this->getAPIResult($url, $qs);
    }
    
  • The actual API call is then a simple POST:
    function getAPIResult($url, $qs) {
      ARC2::inc('Reader');
      $reader = new ARC2_Reader($this->a, $this);
      $reader->setHTTPMethod('POST');
      $reader->setCustomHeaders("Content-Type: application/x-www-form-urlencoded");
      $reader->setMessageBody($qs);
      $reader->activate($url);
      $r = '';
      while ($d = $reader->readStream()) {
        $r .= $d;
      }
      $reader->closeStream();
      return $r;
    }
    
  • Both APIs are fast.

API result processing

  • The APIs return rather verbose data, as they have to stuff in a lot of meta-data such as confidence scores, text positions, internal and external identifiers, etc. But they also offer RDF as one possible result format, so I could store the response data as a simple graph and then use SPARQL queries to extract the relevant information (tags and named entities). Below is the query code for Linked Data entity extraction from Zemanta's RDF. As you can see, the graph structure isn't trivial, but still understandable:
    SELECT DISTINCT ?id ?obj ?cnf ?name
    FROM <' . $g . '> WHERE {
      ?rec a z:Recognition ;
           z:object ?obj ;
           z:confidence ?cnf .
      ?obj z:target ?id .
      ?id z:targetType <http://s.zemanta.com/targets#rdf> ;
          z:title ?name .
      FILTER(?cnf >= 0.4)
    } ORDER BY ?id
    

Extracting normalized tags

  • OpenCalais results contain a section with so-called "SocialTags" which are directly usable as plain-text tags.
  • The tag structures in the Zemanta result are called "Keywords". In my tests they only contained a subset of the detected entities, and so I decided to use the labels associated with detected entities instead. This worked well, but the respective query is more complex.

Extracting entities

  • In general, OpenCalais results can be directly utilized more easily. They contain stable identifiers and the identifiers come with type information and other attributes such as stock symbols. The API result directly tells you how many Persons, Companies, Products, etc. were detected. And the URIs of these entity types are all from a single (OpenCalais) namespace. If you are not a Linked Data pro, this simplifies things a lot. You only have to support a simple list of entity types to build a working semantic application. If you want to leverage the wider Linked Open Data cloud, however, the OpenCalais response is just a first entry point. It doesn't contain community URIs. You have to use the OpenCalais website to first retrieve disambiguation information, which may then (often involving another request) lead you to the decentralized Linked Data identifiers.
  • Zemanta responses, in contrast, do not (yet, Andraž told me they are working on it) contain entity types at all. You always need an additional request to retrieve type information (unless you are doing nasty URI inspection, which is what I did with detected URIs from Semantic CrunchBase). The retrieval of type information is done via Open Data servers, so you have to be able to deal with the usual down-times of these non-commercial services.
  • Zemanta results are very "webby" and full of community URIs. They even include sameAs information. This can be a bit overwhelming if you are not an RDFer, e.g. looking up a DBPedia URI will often give you dozens of entity types, and you need some experience to match them with your internal type hierarchy. But for an open data developer, the hooks provided by Zemanta are a dream come true.
  • With Zemanta associating shared URIs with all detected entities, I noticed network effects kicking in a couple of times. I used RWW articles for the test, and in one post, for example, OpenCalais could detect the company "Starbucks" and "Howard Schultz" as their "CEO", but their public RDF (when I looked up the "Howard Schultz" URI) didn't persist this linkage. The detection scope was limited to the passed snippet. Zemanta, on the other hand, directly gave me Linked Data URIs for both "Starbucks" and "Howard Schultz", and these identifiers make it possible to re-establish the relation between the two entities at any time. This is a very powerful feature.

Summary

Both APIs are great. The quality of the entity extractors is awesome. For the RWW posts, which deal a lot with Web topics, Zemanta seemed to have a couple of extra detections (such as "ReadWriteWeb" as company). As usual, some owl:sameAs information is wrong, and Zemanta uses incorrect Semantic CrunchBase URIs (".rdf#self" instead of "#self" // Update: to be fixed in the next Zemanta API revision), but I blame us (the RDF community), not the API providers, for not making these things easier to implement.

In the end, I decided to use both APIs in combination, with an optional post-processing step that builds a consolidated, internal ontology from the detected entities (OpenCalais has two Company types which could be merged, for example). Maybe I can make a Prospect demo from the RWW data public, not sure if they would allow this. It's really impressive how much value the entity extraction services can add to blog data, though (see the screenshot below, which shows a pivot operation on products mentioned in posts by Sarah Perez). I'll write a bit more about the possibilities in another post.

RWW posts via BlogDB

Could having two RDF-in-HTMLs actually be handy?

A combination of RDFa and Microdata would allow for separate semantic layers.
Apart from grumpy rants about the complexity of W3C's RDF specs and semantic richtext editing excitement, I haven't blogged or tweeted a lot recently. That's partly because there finally is increased demand for the stuff I'm doing at semsol (agency-style SemWeb development), but also because I've been working hard on getting my tools in a state where they feel more like typical Web frameworks and apps. Talis' Fanhu.bz is an example where (I think) we found a good balance between powerful RDF capabilities (data re-purposing, remote models, data augmentation, a crazy army of inference bots) and a non-technical UI (simplistic visual browser, Twitter-based annotation interfaces).

Another example is something I've been working on during the last months: I somehow managed to combine essential parts of Paggr (a drag&drop portal system based on RDF- and SPARQL-based widgets) with an RDF CMS (I'm currently looking for pilot projects). And although I decided to switch entirely to Microdata for semantic markup after exploring it during the FanHubz project, I wonder if there might be room for having two separate semantic layers in this sort of widget-based websites. Here is why:

As mentioned, I've taken a widget-like approach for the CMS. Each page section is a resource on its own that can be defined and extended by the web developer, it can be styled by themers, and it can be re-arranged and configured by the webmaster. In the RDF CMS context, widgets can easily integrate remote data, and when the integrated information is exposed as machine-readable data in the front-end, we can get beyond the "just-visual" integration of current widget pages and bring truly connectable and reusable information to the user interface.

Ideally, both the widgets' structural data and the content can be re-purposed by other apps. Just like in the early days of the Web, we could re-introduce a copy & paste culture of things for people to include in their own sites. With the difference that RDF simplifies copy-by-reference and source attribution. And both developers and end-users could be part of the game this time.

Anyway, one technical issue I encountered is when you have a page that contains multiple page items, but describes a single resource. With a single markup layer (say Microdata), you get a single tree where the context of the hierarchy is constantly switching between structural elements and content items (page structure -> main content -> page layout -> widget structure -> widget content). If you want to describe a single resource, you have to repeatedly re-introduce the triple subject ("this is about the page structure", "this is about the main page topic"). The first screenshot below shows the different (grey) widget areas in the editing view of the CMS. In the second screenshot, you can see that the displayed information (the marked calendar date, the flyer image, and the description) in the main area and the sidebar is about a single resource (an event).

Trice CMS Editor
Trice CMS editing view

Trice CMS Editor
Trice CMS page view with inline widgets describing one resource

If I used two separate semantic layers, e.g. RDFa for the content (the event description) and Microdata for the structural elements (column widths, widget template URIs, widget instance URIs), I could describe the resource and the structure without repeating the event subject in each page item.

To be honest, I'm not sure yet if this is really a problem, but I thought writing it down could kick off some thought processes (which now tend towards "No"). Keeping triples as stand-alone-ish as possible may actually be an advantage (even if subject URIs have to be repeated). No semantic markup solution so far provides full containment for reliable copy & paste, but explicit subjects (or "itemid"s in Microdata-speak) could bring us a little closer.

Conclusions? Err.., none yet. But hey, did you see the cool CMS screenshots?

Naming Properties and Relations (comment)

A local comment to JeniT's post about predicate names
I was incapable of adding a comment to Jeni's interesting post about RDF predicate Names (markdown-related, my fault), so I'll quickly post it here, as I'm pondering similar things, too.

In her post, Jeni explores the issues around naming RDF terms. The community gathered a couple of experiences and suggestions in the last years, some entry points are:
I personally find "role-noun" easier to support in RDF apps than the older hasPropertyOf (now often considered anti-)pattern. And inverse properties are just painful, as they usually require some form of inference to streamline the user experience.

Not sure if that's helpful information, but for a project around semantic note-taking/logging, I played with different notations users might be comfortable with, for entering factoids using an unstructured input form (à la Twitter). I could identify the following patterns that still seemed to be acceptable (as shared/supported syntax). All of them can be implemented using role-noun predicates (assuming that predicate labels are similar to the predicate names):
  • SUBJECT'(s)? PREDICATE (:|is) OBJECT
  • OBJECT is SUBJECT'(s)? PREDICATE
  • OBJECT is (the)? PREDICATE of SUBJECT
  • SUBJECT has PREDICATE (:)? OBJECT
  • (the|a)? PREDICATE(s)? of SUBJECT (is|are) OBJECT ((,|and|&) OBJECT)*
(There are more patterns, for things like tagging and typing, but the examples above are the predicate-related grammar rules).

As soon as you add (has|is|of) to one PREDICATE, you get problems with the other notations, so role-noun seems to be a good fit.

Unfortunately, one (non-trivial) problem remains: People (and Web 2.0 apps) also like 'SUBJECT PREDICATE_VERB OBJECT' (e.g. "likes", "bookmarked", "said", "posted", "is listening to" ...) and I don't have a proper idea how to handle those automatically yet, other than hard-coding support for the typical social media verbs. It could be possible to use wordnet to detect verbs and derive a canonicalized form, and then model those patterns as activities (activity = liking, bookmarking, saying, posting, listening, plus ACTIVITY_PERSON and ACTIVITY_TARGET or somesuch). If anyone has a suggestion, I'd be happy to hear it.

Back from New York "Semantic Web for PHP Developers" trip

Gave a talk and a workshop in NYC about SemWeb technologies for PHP developers
/me at times square I'm back from New York, where I was given the great opportunity to talk about two of my favorite topics: Semantic Web Development with PHP, and (not necessarily semantic) Software Development using RDF Technology. I was especially looking forward to the second one, as that perspective is not only easier to understand for people from a software engineering context, but also because it is still a much neglected marketing "back-door": If RDF simplifies working with data in general (and it does), then we should not limit its use to semantic web apps. Broader data distribution and integration may naturally follow in a second or third step once people use the technology (so much for my contribution to Michael Hausenblas' list of RDF MalBest Practices ;)

The talk on Thursday at the NY Semantic Web Meetup was great fun. But the most impressive part of the event were the people there. A lot to learn from on this side of the pond. Not only very practical and professional, but also extremely positive and open. Almost felt like being invited to a family party.

The positive attitude was even true for the workshop, which I clearly could have made more effective. I didn't expect (but should have) that many people would come w/o a LAMP stack on their laptops, so we lost a lot of time setting up MAMP/LAMP/WAMP before we started hacking ARC, Trice, and SPARQL.

Marco brought up a number of illustrating use cases. He maintains an (inofficial, sorry, can't provide a pointer) RDF wrapper for any group on meetup.com, so the workshop participants could directly work with real data. We explored overlaps between different Meetup groups, the order in which people joined selected groups, inferred new triples from combined datasets via CONSTRUCT, and played with not-yet-standard SPARQL features like COUNT and LOAD.

And having done the workshop should finally give me the last kick to launch the Trice site now. The code is out, and it's apparently not too tricky to get started even when the documentation is still incomplete. Unfortunately, I have a strict "no more non-profits" directive, but I think Trice, despite being FOSS, will help me get some paid projects, so I'll squeeze an official launch in sometime soon-ish.

Below are the slides from the meetup. I added some screenshots, but they are probably still a bit boring without the actual demos (I think a video will be put up in a couple of days, though).

ARC Graph Gear Serializer Plugin

Patrick Murray-John created an ARC2 converter for Graph Gear visualizations
Patrick Murray-John (who is currently Semantifying the University of Mary Washington) just released a first version of an ARC2 converter for Graph Gear visualizations. Looks pretty cool.
Graph Gear visualization from RDF via ARC

RDF/SPARQL-based web development for PHP coders: Meetup presentation and workshop in NYC

I'll give a talk and run a workshop in New York City in May.
The Linked Data meme is spreading and we have strong indications that web developers who understand and know how to apply practical semantic web technologies will soon be in high demand. Not only in enterprise settings but increasingly for mainstream and agency-level projects where scripting languages like PHP are traditionally very popular.

I can't really afford travelling to promote the interesting possibilities around RDF and SPARQL for PHP coders, so I'm more than happy that Meetup master Marco Neumann offered me to come over to New York and give a talk at the Meetup on May 21st. Expect a fun mixture of "Getting started" hints, demos, and lessons learned. In order to make this trip possible, Marco is organizing a half-day workshop on May 22nd, where PHP developers will get a hands-on introduction to essential SemWeb technologies. I'm really looking forward to it (and big thanks to Marco).

So, if you are a PHP developer wondering about the possibilities of RDF, Linked Data & Co, come to the Meetup, and if you also want to get your hands dirty (or just help me pay the flight ticket ;) the workshop could be something for you, too. I'll arrive a few days earlier, by the way, in case you want to add another quaff:drankBeerWith triple to your FOAF file ;)

ARC now also GPL-licensed

ARC is now available under the W3C Software or the GPL license
Arto Bendiken and Stéphane Corlosquet asked me to provide ARC also under the GPL (for Drupal, in addition to the current W3C Software License), so here you are.

ARC is already used by several modules that help turn Drupal into an RDF-powered CMS, for example the RDF API, the SPARQL extension, or the Calais module. The new license will make it easier for the Drupal community to directly bundle ARC with their RDF extensions. I guess that Drupal will have its own complete RDF toolkit one day, but it's great to see ARC being utilized for accelerating the development progress.

Paggr screencast: Linked Data Widget Builder

A screencast about Paggr's sparqlet builder.
Running an R&D-heavy agency in the current economical climate is pretty tough, but there are also a couple of new opportunities for these semantic solutions that help reduce costs and do things more efficiently. I'm finally starting to get project requests that include some form of compensation. Not much yet (all budgets seem to be very tight these days), but it's a start, and together with support from Susanne, I could now continue working on Paggr, semsol's Netvibes-like dashboard system for the growing web of Linked Data.

An article about Paggr will be in the next Nodalities Magazine, and the ESWC2009 technologies team is considering a custom system for attendees which is a great chance to maybe get other conference organizers interested. (I see much potential in a white-label offering, but a more mainstream-ish version for Web 2.0 data is still on my mind. Just have to focus on getting self-sustained first.)

Below is a short screencast that demonstrates a first version of the sparqlet (= semantic widget) builder. I've de-coupled sparqlet-serving from the dashboard system, so that I'll be able to open-source the infrastructure parts of Paggr more easily. Another change from the October prototype is the theme-ability of both dashboards and widget servers. Lots of sun, sky, and sea for ESWC ;-)



HQ version (quicktime, 120MB)

Semantics @ SIMsKultur Online

SIMsKultur Online is adding semantics
Exciting times, it really looks like we are about to witness RDF's tipping point. Every other week we see another service adding semantic web support. I didn't even find time to play with O'Reilly's RDF data yet, and yesterday I already came across the next site: SIMsKultur not only added RDF export for all events (more info at evo42), but also put up a hacked smesher instance to enrich and filter their Tweets (work in progress). I've been told that even SPARQL support is on their list.

smesher @ SIMsKultur

This is exactly the stuff I was dreaming of when I started with RDF development: Web agencies enhancing their customers' experience with easy-to-deploy solutions. I didn't expect it to become such a marathon, and we're still not fully there yet, but it feels a lot like we're finally hitting the home stretch :-)

Quick thoughts on semantic microblogging

Motivation and wish list for a personal semantic microblogging system
This week, the first "Microblogging Conference Europe" will take place in Hamburg. I was lucky to get a late ticket (thanks to Dirk Olbertz, who won't be able to make it). The conference will have barcamp-style tracks, and (narrow-minded as I am) I started thinking about adding SemWeb power to microblogging.

The more I use Twitter and advanced clients like TweetDeck, the more I think that (slightly enhanced) microblogs could become great interfaces to the (personalized) Semantic Web. I'm already noticing that I don't use a feed reader or delicious to discover relevant content any more. I'm effectively saving time. But simultaneously it becomes obvious that Twitter can be a distracting productivity killer. So, here is the idea: Take all the good things from microblogging and add enough semantics to increase productivity again. And while at it, utilize this semantic microblog as a work/life/idea log.

A semantic microblog would simplify the creation of structured, machine-readable information, in part for personal use, and generally to let the computer take care of certain tasks or do things that I didn't think of yet.

I have only two days left to prepare a demo and a talk, so I better start developing. I'll keep the rest of this post short and log my progress on Twitter instead. The app will be called "smesher". I'm starting now (or rather tomorrow morning, have to leave in 15 mins).

Use cases

  • How much time did I spend doing support this month?
  • Who are my real contacts (evidence-driven, please, why do I have to manually add followees)?
  • Show me a complete history of activities related to project A
  • How much can I bill client B? (or even better: Generate an invoice for client B)
  • What was that great Tapas Bar we went to last summer again?
  • Where did I first meet C?
  • Bookmarks ranked by number of occurrences in other tweets
  • Show me all my blog posts about topic D
  • ...

Microblogs: Strengths

  • Microblogs are web-based
  • Microblogs are very easy to use ("less is more")
  • Microblogs offer a great communication channel (asynchronous, but almost instant)
  • Microblog clients are getting ubiquitous
  • Microblogs can be used as life logs
  • Microblogs can be used for note taking
  • Microblogs can be used for bookmarking
  • Microblogs can be used for announcements
  • Microblogs can accelerate software development (near-real-time feedback loop)
  • Microblog search (and the associated feeds) can be used to track interests
  • hashtags are a simple way to annotate posts
  • A Microblog can be used as an interface to bots

Some Requirements and Nice-to-haves for semantic microblogging

  • access to a post's default information (author, title, date, source)
  • support for evolving patterns (@-recipients, people mentioned, URLs mentioned, hashtags, Re-Tweets)
  • groups, or at least private notes (some posts just don't need to be on the public timeline ;)
  • complete archives
  • perhaps semantic auto-tagging
  • post-publication tags (I'll surely forget a necessary tag every now and then)
  • private tags?
  • keep the simple UI (no checkbox overload etc.)
  • support for machine tags or a similar grassroots extensibility mechanism to increase granularity without losing usability/simplicity
  • an API that supports user-defined and evolving structures
  • custom streams/tabs à la TweetDeck, but with semantic filtering (e.g. "This month's working hours")
  • URL expander for bit.ly etc.
  • rules to create/infer/extract information from (machine) tags and existing data, maybe recursively
  • Twitter/Identi.ca tracking/relaying

Approach

  • Getting Real (UI first etc., worked great last time)
  • RDF 'n' SPARQL FTW: I don't know what the final data model is going to be, and I want an API but don't have time to code it.

Related Work

Knowee - (The beginning of) a semantic social web address book

Knowee is a web address book that lets you integrate distributed social graph fragments. A new version is online at knowee.net.
Heh, this was planned as a one-week hack but somehow turned into a full re-write that took the complete December. Yesterday, I finally managed to tame the semantic bot army and today I've added a basic RDF editor. A sponsored version is now online at knowee.net, a code bundle for self-hosting will be made available at knowee.org tomorrow.

What is Knowee?

Knowee started as a SWEO project. Given the insane number of online social networks we all joined, together with the increasing amount of machine-readable "social data" sources, we dreamed of a distributed address book, where the owner doesn't have to manually maintain contact data, but instead simply subscribes to remote sources. The address book could then update itself automatically. And -in full SemWeb spirit- you'd get access to your consolidated social graph for re-purposing. There are several open-source projects in this area, most notably NoseRub and DiSo. Knowee is aiming at interoperability with these solutions.
knowee concept

Ingredients

For a webby address book, we need to pick some data formats, vocabularies, data exchange mechanisms, and the general app infrastructure:
  • PHP + MySQL: Knowee is based on the ubiquitous LAMP stack. It tries to keep things simple, you don't need system-level access for third-party components or cron jobs.
  • RDF: Knowee utilizes the Resource Description Framework. RDF gives us a very simple model (triples), lots of different formats (JSON, HTML, XML, ...), and free, low-cost extensibility.
  • FOAF, OpenSocial, microformats, Feeds: FOAF is the leading RDF vocabulary for social information. Feeds (RSS, Atom) are the lowest common denominator for exchanging non-static information. OpenSocial and microformats are more than just schemas, but the respective communities maintain very handy term sets, too. Knowee uses equivalent representations in RDF.
  • SPARQL: SPARQL is the W3C-recommended Query language and API for the Semantic Web.
  • OpenID: OpenID addresses Identity and Authentication requirements.
I'm still working on a solution for access control, the current Knowee version is limited to public data and simple, password-based access restrictions. OAuth is surely worth a look, although Knowee's use case is a little different and may be fine with just OpenID + sessions. Another option could be the impressive FOAF+SSL proposal, I'm not sure if they'll manage to provide a pure-PHP implementation for non-SSL-enabled hosts, though.

Features / Getting Started

This is a quick walk-through to introduce the current version.
Login / Signup
Log in with your (ideally non-XRDS) OpenID and pick a user name.

knowee login

Account setup
Knowee only supports a few services so far. Adding new ones is not hard, though. You can enable the SG API to auto-discover additional accounts. Hit "Proceed" when you're done.

knowee accounts

Profile setup
You can specify whether to make (parts of) your consolidated profile public or not. During the initial setup process, this screen will be almost empty, you can check back later when the semantic bots have done their job. Hit "Proceed".

knowee profile

Dashboard
The Dashboard shows your personal activity stream (later versions may include your contacts' activities, too), system information and a couple of shortcuts.
knowee dashboard

Contacts
The contact editor is still work in progress. So far, you can filter the list, add new entries, and edit existing contacts. The RDF editor is still pretty basic (Changes will be saved to a separate RDF graph, but deleted/changed fields may re-appear after synchronization. This needs more work.) The editor is schema-based and supports the vocabularies mentioned above. You'll be able to create your own fields at some later stage.

It's already possible to import FOAF profiles. Knowee will try to consolidate imported contacts so that you can add data from multiple sources, but then edit the information via a single form. The bot processor is extensible, we'll be able to add additional consolidators at run-time, it only looks at "owl:sameAs" at the moment.
knowee contacts

Enabling the SPARQL API
In the "Settings" section you'll find a form that lets you activate a personal SPARQL API. You can enable/protect read and/or write operations. The SPARQL endpoint provides low-level access to all your data, allows you to explore your social graph, or lets you create backups of your activity stream.

knowee api knowee api

That's more or less it for this version. You can always reset or delete your account, and manually delete incorrectly monitored graphs. The knowee.net system is running on the GoGrid cloud, but I'm still tuning things to let the underlying RDF CMS make better use of the multi-server setup. If things go wrong, blame me, not them. Caching is not fully in place yet, and I've limited the installation to 100 accounts. Give it a try, I'd be happy about feedback.

OpenSocial in RDF

I've created an RDF converter for the OpenSocial field definitions.
I'm currently working on a new release of Knowee. This is another (long-promised) item on my ToDo list before I can finally concentrate on paggr (although it took too long already and hopefully won't break my neck. All the planned paid projects for bootstrapping paggr didn't happen, due to frozen budgets and politics. I hope the situation here improves soon.)

So, while I was trawling the vocabulary market, trying to gather terms for the stuff that Knowee works with (people, their profiles, contacts, accounts, and activities), I remembered OpenSocial, the effort to standardize basic interactions between social networking sites. I can use a good amount of FOAF, but OpenSocial has very handy things such as a generic "tags" field and a clean vCard mapping. And it's a super-set of Portable Contacts, too.

Today, I wrote a converter that extracts the field definitions from the JavaScript specification files, together with their labels, comments, domains, and value types. (A little too late, I found out that Dan Brickley had already done part of this a couple of months ago, could have saved me some work, d'oh.)

I've just added the osoc spec to web-semantics.org/ns. I hope it might be of use to others as well. Funnily, the "relationship" term was not part of any of the source files, maybe I still have to invent a property (a foaf:knows equivalent that also works with organizations).

poshRDF - RDF extraction from microformats and ad-hoc markup

poshRDF is a new attempt to extract RDF from microformats and ad-hoc markup
I've been thinking about this since Semantic Camp where I had an inspiring dialogue with Keith Alexander about semantics in HTML. We were wondering about the feasibility of a true microformats superset, where existing microformats could be converted to RDF without the need to write a dedicated extractor for each format. This was also about the time when "scoping" and context issues around certain microformats started to be discussed (What happens for example with other people's XFN markup, aggregated in a widget on my homepage? Does it affect my social graph as seen by XFN crawlers? Can I reuse existing class names for new formats, or do we confuse parsers and authors then? Stuff like that).

A couple of days ago I finally wrote up this "poshRDF" idea on the ESW wiki and started with an implementation for paggr widgets, which are meant to expose machine-readable data from RDFa, microformats, but also from user-defined, ad-hoc formats, in an efficient way. PoshRDF can enable single-pass RDF extraction for a set of formats. Previously, my code had to walk through the DOM multiple times, once for each format.

A poshRDF parser is going to be part of one of the next ARC revisions. I've just put up a site at poshrdf.org to host the dynamic posh namespace. For now the site links to a possibly interesting by-product: A unified RDF/OWL schema for the most popular microformats: xfn, rel-tag, rel-bookmark, rel-nofollow, rel-directory, rel-license, hcard, hcalendar, hatom, hreview, xfolk, hresume, address, and geolocation. It's not 100% correct, poshRDF is after all still a generic mechanism and doesn't cover format-specific interpretations. But it might be interesting for implementors. The schema could be used to generate dedicated parser configurations. It also describes the typical context of class names so that you can work around scoping issues (e.g. the XFN relations are usually scoped to the document or embedded hAtom entries).

I hope to find some time to build a JSON exporter and microformats validator on top of poshRDF in the not too distant future. Got to move on for now, though. Dear Lazyweb, feel free to jump in ;)

paggr teaser video and pre-registration site online

paggr teaser video and landing page
I've been semi-silently working on something new. A combination of many semwebby things I came across and played with during the last 3 years or so:
  • semantic markup
  • smart data
  • an rdf clipboard
  • ajax
  • sparql sparql sparql
  • sparql + scripting
  • sparql + templates
  • sparql + widgets
  • lightweight, federated semweb services and bots
  • UIs for open data
  • semwikis
  • agile and collaborative web development

So, what happens when you put this all together? At least something interesting, and perhaps semsol's first commercial service. (Or product, this is all just LAMP stuff and can easily be run in an intranet or on a hosted server). Anyway, still some way to go. It's called paggr, the landing page is up, and today I created a first teaser/intro video.

I'll demo the beta (launch planned for November) at upcoming ISWC during the poster session (my poster is about SPARQL+ and SPARQLScript, the two SPARQL extensions that paggr is based on). I may have early invites by then.

As a preparation for the hopefully busy fall and winter months, though, I'll be on vacation for the next two weeks. No Email, no Web, no Phone. Yay!



HQ version (quicktime, 130MB)

Archives/Search

YYYY or YYYY/MM
No Posts found

Feeds