finally a bnode with a uri

Posts tagged with: trice

Dynamic Semantic Publishing for any Blog (Part 2: Linked ReadWriteWeb)

A DSP proof of concept using ReadWriteWeb.com data.
The previous post described a generic approach to BBC-style "Dynamic Semantic Publishing", where I wondered if it could be applied to basically any weblog.

During the last days I spent some time on a test evaluation and demo system using data from the popular ReadWriteWeb tech blog. The application is not public (I don't want to upset the content owners and don't have any spare server anyway), but you can watch a screencast (embedded below).

The application I created is a semantic dashboard which generates dynamic entity hubs and allows you to explore RWW data via multiple dimensions. To be honest, I was pretty surprised myself by the dynamics of the data. When I switched back to the official site after using the dashboard for some time, I totally missed the advanced filtering options.



In case you are interested in the technical details, fasten your data seatbelt and read on.

Behind the scenes

As mentioned, the framework is supposed to make it easy for site maintainers and should work with plain HTML as input. Direct access to internal data structures of the source system (database tables, post/author/commenter identifiers etc.) should not be needed. Even RDF experts don't have much experience with side effects of semantic systems directly hooked into running applications. And with RDF encouraging loosely coupled components anyway, it makes sense to keep the semantification on a separate machine.

In order to implement the process, I used Trice (once again), which supports simple agents out of the box. The bot-based approach already worked quite nicely in Talis' FanHubz demonstrator, so I followed this route here, too. For "Linked RWW", I only needed a very small number of bots, though.

Trice Bot Console

Here is a quick re-cap of the proposed dynamic semantic publishing process, followed by a detailed description of the individual components:
  • Index and monitor the archives pages, build a registry of post URLs.
  • Load and parse posts into raw structures (title, author, content, ...).
  • Extract named entities from each post's main content section.
  • Build a site-optimized schema (an "ontology") from the data structures generated so far.
  • Align the extracted data structures with the target ontology.
  • Re-purpose the final dataset (widgets, entity hubs, semantic ads, authoring tools)

Archives indexer and monitor

The archives indexer fetches the by-month archives, extracts all link URLs matching the "YYYY/MM" pattern, and saves them in an ARC Store.

The implementation of this bot was straightforward (less than 100 lines of PHP code, including support for pagination); this is clearly something that can be turned into a standard component for common blog engines very easily. The result is a complete list of archives pages (so far still without any post URLs) which can be accessed through the RDF store's built-in SPARQL API:

Archives triples via SPARQL

A second bot (the archives monitor) receives either a not-yet-crawled index page (if available) or the most current archives page as a starting point. Each post link of that page is then extracted and used to build a registry of post URLs. The monitoring bot is called every 10 minutes and keeps track of new posts.

Post loader and parser

In order to later process post data at a finer granularity than the page level, we have to extract sub-structures such as title, author, publication date, tags, and so on. This is the harder part because most blogs don't use Linked Data-ready HTML in the form of Microdata or RDFa. Luckily, blogs are template-driven and we can use DOM paths to identify individual post sections, similar to how tools like the Dapper Data Mapper work. However, given the flexibility and customization options of modern blog engines, certain extensions are still needed. In the RWW case I needed site-specific code to expand multi-page posts, to extract a machine-friendly publication date, Facebook Likes and Tweetmeme counts, and to generate site-wide identifiers for authors and commenters.

Writing this bot took several hours and almost 500 lines of code (after re-factoring), but the reward is a nicely structured blog database that can already be explored with an off-the-shelf RDF browser. At this stage we could already use the SPARQL API to easily create dynamic widgets such as "related entries" (via tags or categories), "other posts by same author", "most active commenters per category", or "most popular authors" (as shown in the example in the image below).

Raw post structures

Named entity extraction

Now, the next bot can take each post's main content and enhance it with Zemanta and OpenCalais (or any other entity recognition tool that produces RDF). The result of this step is a semantified, but rather messy dataset, with attributes from half a dozen RDF vocabularies.

Schema/Ontology identification

Luckily, RDF was designed for working with multi-source data, and thanks to the SPARQL standard, we can use general purpose software to help us find our way through the enhanced assets. I used a faceted browser to identify the site's main entity types (click on the image below for the full-size version).

RWW through Paggr Prospect

Although spotting inconsistencies (like Richard MacManus appearing multiple times in the "author" facet) is easier with a visual browser, a simple, generic SPARQL query can alternatively do the job, too:

RWW entity types

Specifying the target ontology

The central entity types extracted from RWW posts are Organizations, People, Products, Locations, and Technologies. Together with the initial structures, we can now draft a consolidated RWW target ontology, as illustrated below. Each node gets its own identifier (a URI) and can thus be a bridge to the public Linked Data cloud, for example to import a company's competitor information.

RWW ontology

Aligning the data with the target ontology

In this step, we are again using a software agent and break things down into smaller operations. These sub-tasks require some RDF and Linked Data experience, but basically, we are just manipulating the graph structure, which can be done quite comfortably with a SPARQL 1.1 processor that supports INSERT and DELETE commands. Here are some example operations that I applied to the RWW data:
  • Consolidate author aliases ("richard-macmanus-1 = richard-macmanus-2" etc.).
  • Normalize author tags, Zemanta tags, OpenCalais tags, and OpenCalais "industry terms" to a single "tag" field.
  • Consolidate the various type identifiers into canonical ones.
  • For each untyped entity, retrieve typing and label information from the Linked Data cloud (e.g. DBPedia, Freebase, or Semantic CrunchBase) and try to map them to the target ontology.
  • Try to consolidate "obviously identical" entities (I cheated by merging on labels here and there, but it worked).
Data alignment and QA is an iterative process (and a slightly slippery slope). The quality of public linked data varies, but the cloud is very powerful. Each optimization step adds to the network effects and you constantly discover new consolidation options. I spent just a few hours on the inferencer, after all, the Linked RWW demo is just meant to be a proof of concept.

After this step, we're basically done. From now on, the bots can operate autonomously and we can (finally) build our dynamic semantic publishing apps, like the Paggr Dashboard presented in the video above.

Dynamic RWW Entity Hub

Conclusion

Dynamic Semantic Publishing on mainstream websites is still new, and there are no complete off-the-shelf solutions on the market yet. Many of the individual components needed, however, are available. Additionally, the manual effort to integrate the tools is no longer incalculable research, but is getting closer to predictable "standard" development effort. If you are perhaps interested in a solution similar to the ones described in this post, please get in touch.

Dynamic Semantic Publishing for any Blog (Part 1)

Bringing automated semantic page generation a la BBC to standard web environments.
"Dynamic Semantic Publishing" is a new technical term which was introduced by the BBC's online team a few weeks ago. It describes the idea of utilizing Linked Data technology to automate the aggregation and publication of interrelated content objects. The BBC's World Cup website was the first large mainstream website to use this method. It provides hundreds of automatically generated, topically composed pages for individual football entities (players, teams, groups) and related articles.

Now, the added value of such linked "entity hubs" would clearly be very interesting for other websites and blogs as well. They are multi-dimensional entry points to a site and provide a much better and more user-engaging way to explore content than the usual flat archives pages, which normally don't have dimensions beyond date, tag, and author. Additionally, HTML aggregations with embedded Linked Data identifiers can improve search engine rankings, and they enable semantic ad placement, which are attractive by-products.

Entity hub examples

The architecture used by the BBC is optimized for their internal publishing workflow and thus not necessarily suited for small and medium-scale media outlets. So I've started thinking about a lightweight version of the BBC infrastructure, one that would integrate more easily with typical web server environments and widespread blog engines.

How could a generalized approach to dynamic semantic publishing look like?

We should assume setups where direct access to a blog's database tables is not available. Working with already published posts requires a template detector and custom parsers, but it lowers the entry barrier for blog owners significantly. And content importers can be reused to a large extent when sites are based on standard blog engines such as WordPress or Movable Type.

The graphic below (large version) illustrates a possible, generalized approach to dynamic semantic publishing.
Dynamic Semantic Publishing

Process explanation:
  • Step 1: A blog-specific crawling agent indexes articles linked from central archives pages. The index is stored as RDF, which enables the easy expansion of post URLs to richly annotated content objects.
  • Step 2: Not-yet-imported posts from the generated blog index are parsed into core structural elements such as title, author, date of publication, main content, comments, Tweet counters, Facebook Likes, and so on. The semi-structured post information is added to the triple store for later processing by other agents and scripts. Again, we need site (or blog engine)-specific code to extract the various possible structures. This step could be accelerated by using an interactive extractor builder, though.
  • Step 3: Post contents are passed to APIs like OpenCalais or Zemanta in order to extract stable and re-usable entity identifiers. The resulting data is added to the RDF Store.
  • After the initial semantification in step 3, a generic RDF data browser can be used to explore the extracted information. This simplifies general consistency checks and the identification of the site-specific ontology (concepts and how they are related). Alternatively, this could be done (in a less comfortable way) via the RDF store's SPARQL API.
  • Step 4: Once we have a general idea of the target schema (entity types and their relations), custom SPARQL agents process the data and populate the ontology. They can optionally access and utilize public data.
  • After step 4, the rich resulting graph data allows the creation of context-aware widgets. These widgets ("Related articles", "Authors for this topic", "Product experts", "Top commenters", "Related technologies", etc.) can now be used to build user-facing applications and tools.
  • Use case 1: Entity hubs for things like authors, products, people, organizations, commenters, or other domain-specific concepts.
  • Use case 2: Improving the source blog. The typical "Related articles" sections in standard blog engines, for example, don't take social data such as Facebook Likes or re-tweets into account. Often, they are just based on explicitly defined tags. With the enhanced blog data, we can generate aggregations driven by rich semantic criteria.
  • Use case 3: Authoring extensions: After all, the automated entity extraction APIs are not perfect. With the site-wide ontology in place, we could provide content creators with convenient annotation tools to manually highlight some text and then associate the selection with a typed entity from the RDF store. Or they could add their own concepts to the ontology and share it with other authors. The manual annotations help increase the quality of the entity hubs and blog widgets.

Does it work?

I explored this approach to dynamic semantic publishing with nearly nine thousand articles from ReadWriteWeb. In the next post, I'll describe a "Linked RWW" demo which combines Trice bots, ARC, Prospect, and the handy semantic APIs provided by OpenCalais and Zemanta.

Trice' Semantic Richtext Editor

A screencast demonstrating the structured RTE bundled with the Trice CMS
In my previous post I mentioned that I'm building a Linked Data CMS. One of its components is a rich-text editor that allows the creation (and embedding) of structured markup.

An earlier version supported limited Microdata annotations, but now I've switched the mechanism and use an intermediate, but even simpler approach based on HTML5's handy data-* attributes. This lets you build almost arbitrary markup with the editor, including Microformats, Microdata, or RDFa. I don't know yet when the CMS will be publicly available (3 sites are under development right now), but as mentioned, I'd be happy about another pilot project or two. Below is a video demonstrating the editor and its easy customization options.

Could having two RDF-in-HTMLs actually be handy?

A combination of RDFa and Microdata would allow for separate semantic layers.
Apart from grumpy rants about the complexity of W3C's RDF specs and semantic richtext editing excitement, I haven't blogged or tweeted a lot recently. That's partly because there finally is increased demand for the stuff I'm doing at semsol (agency-style SemWeb development), but also because I've been working hard on getting my tools in a state where they feel more like typical Web frameworks and apps. Talis' Fanhu.bz is an example where (I think) we found a good balance between powerful RDF capabilities (data re-purposing, remote models, data augmentation, a crazy army of inference bots) and a non-technical UI (simplistic visual browser, Twitter-based annotation interfaces).

Another example is something I've been working on during the last months: I somehow managed to combine essential parts of Paggr (a drag&drop portal system based on RDF- and SPARQL-based widgets) with an RDF CMS (I'm currently looking for pilot projects). And although I decided to switch entirely to Microdata for semantic markup after exploring it during the FanHubz project, I wonder if there might be room for having two separate semantic layers in this sort of widget-based websites. Here is why:

As mentioned, I've taken a widget-like approach for the CMS. Each page section is a resource on its own that can be defined and extended by the web developer, it can be styled by themers, and it can be re-arranged and configured by the webmaster. In the RDF CMS context, widgets can easily integrate remote data, and when the integrated information is exposed as machine-readable data in the front-end, we can get beyond the "just-visual" integration of current widget pages and bring truly connectable and reusable information to the user interface.

Ideally, both the widgets' structural data and the content can be re-purposed by other apps. Just like in the early days of the Web, we could re-introduce a copy & paste culture of things for people to include in their own sites. With the difference that RDF simplifies copy-by-reference and source attribution. And both developers and end-users could be part of the game this time.

Anyway, one technical issue I encountered is when you have a page that contains multiple page items, but describes a single resource. With a single markup layer (say Microdata), you get a single tree where the context of the hierarchy is constantly switching between structural elements and content items (page structure -> main content -> page layout -> widget structure -> widget content). If you want to describe a single resource, you have to repeatedly re-introduce the triple subject ("this is about the page structure", "this is about the main page topic"). The first screenshot below shows the different (grey) widget areas in the editing view of the CMS. In the second screenshot, you can see that the displayed information (the marked calendar date, the flyer image, and the description) in the main area and the sidebar is about a single resource (an event).

Trice CMS Editor
Trice CMS editing view

Trice CMS Editor
Trice CMS page view with inline widgets describing one resource

If I used two separate semantic layers, e.g. RDFa for the content (the event description) and Microdata for the structural elements (column widths, widget template URIs, widget instance URIs), I could describe the resource and the structure without repeating the event subject in each page item.

To be honest, I'm not sure yet if this is really a problem, but I thought writing it down could kick off some thought processes (which now tend towards "No"). Keeping triples as stand-alone-ish as possible may actually be an advantage (even if subject URIs have to be repeated). No semantic markup solution so far provides full containment for reliable copy & paste, but explicit subjects (or "itemid"s in Microdata-speak) could bring us a little closer.

Conclusions? Err.., none yet. But hey, did you see the cool CMS screenshots?

Code.semsol.org - A central home for semsol code

Semsol gets code repositories and browsers
The code bundles on the ARC website are generated in an inefficient manual process, and each patch has to wait for the next to-be-generated zip file. The developer community is growing (there are now 600 ARC downloads each month), I'm increasingly receiving patches and requests for a proper repository, and the Trice framework is about to get online as well. So I spent last week on building a dedicated source code site for all semsol projects at code.semsol.org.

So far, it's not much more than a directory browser with source preview and a little method navigator. But it will simplify code sharing and frequent updates for me, and hopefully also for ARC and Trice developers. You can checkout various Bazaar code branches and generate a bundle from any directory. The app can't display repository messages yet (the server doesn't have bzr installed, I'm just deploying branches using the handy FTP option), but I'll try to come up with a work-around or an alternative when time permits.

Code Browser

Back from New York "Semantic Web for PHP Developers" trip

Gave a talk and a workshop in NYC about SemWeb technologies for PHP developers
/me at times square I'm back from New York, where I was given the great opportunity to talk about two of my favorite topics: Semantic Web Development with PHP, and (not necessarily semantic) Software Development using RDF Technology. I was especially looking forward to the second one, as that perspective is not only easier to understand for people from a software engineering context, but also because it is still a much neglected marketing "back-door": If RDF simplifies working with data in general (and it does), then we should not limit its use to semantic web apps. Broader data distribution and integration may naturally follow in a second or third step once people use the technology (so much for my contribution to Michael Hausenblas' list of RDF MalBest Practices ;)

The talk on Thursday at the NY Semantic Web Meetup was great fun. But the most impressive part of the event were the people there. A lot to learn from on this side of the pond. Not only very practical and professional, but also extremely positive and open. Almost felt like being invited to a family party.

The positive attitude was even true for the workshop, which I clearly could have made more effective. I didn't expect (but should have) that many people would come w/o a LAMP stack on their laptops, so we lost a lot of time setting up MAMP/LAMP/WAMP before we started hacking ARC, Trice, and SPARQL.

Marco brought up a number of illustrating use cases. He maintains an (inofficial, sorry, can't provide a pointer) RDF wrapper for any group on meetup.com, so the workshop participants could directly work with real data. We explored overlaps between different Meetup groups, the order in which people joined selected groups, inferred new triples from combined datasets via CONSTRUCT, and played with not-yet-standard SPARQL features like COUNT and LOAD.

And having done the workshop should finally give me the last kick to launch the Trice site now. The code is out, and it's apparently not too tricky to get started even when the documentation is still incomplete. Unfortunately, I have a strict "no more non-profits" directive, but I think Trice, despite being FOSS, will help me get some paid projects, so I'll squeeze an official launch in sometime soon-ish.

Below are the slides from the meetup. I added some screenshots, but they are probably still a bit boring without the actual demos (I think a video will be put up in a couple of days, though).

Knowee - (The beginning of) a semantic social web address book

Knowee is a web address book that lets you integrate distributed social graph fragments. A new version is online at knowee.net.
Heh, this was planned as a one-week hack but somehow turned into a full re-write that took the complete December. Yesterday, I finally managed to tame the semantic bot army and today I've added a basic RDF editor. A sponsored version is now online at knowee.net, a code bundle for self-hosting will be made available at knowee.org tomorrow.

What is Knowee?

Knowee started as a SWEO project. Given the insane number of online social networks we all joined, together with the increasing amount of machine-readable "social data" sources, we dreamed of a distributed address book, where the owner doesn't have to manually maintain contact data, but instead simply subscribes to remote sources. The address book could then update itself automatically. And -in full SemWeb spirit- you'd get access to your consolidated social graph for re-purposing. There are several open-source projects in this area, most notably NoseRub and DiSo. Knowee is aiming at interoperability with these solutions.
knowee concept

Ingredients

For a webby address book, we need to pick some data formats, vocabularies, data exchange mechanisms, and the general app infrastructure:
  • PHP + MySQL: Knowee is based on the ubiquitous LAMP stack. It tries to keep things simple, you don't need system-level access for third-party components or cron jobs.
  • RDF: Knowee utilizes the Resource Description Framework. RDF gives us a very simple model (triples), lots of different formats (JSON, HTML, XML, ...), and free, low-cost extensibility.
  • FOAF, OpenSocial, microformats, Feeds: FOAF is the leading RDF vocabulary for social information. Feeds (RSS, Atom) are the lowest common denominator for exchanging non-static information. OpenSocial and microformats are more than just schemas, but the respective communities maintain very handy term sets, too. Knowee uses equivalent representations in RDF.
  • SPARQL: SPARQL is the W3C-recommended Query language and API for the Semantic Web.
  • OpenID: OpenID addresses Identity and Authentication requirements.
I'm still working on a solution for access control, the current Knowee version is limited to public data and simple, password-based access restrictions. OAuth is surely worth a look, although Knowee's use case is a little different and may be fine with just OpenID + sessions. Another option could be the impressive FOAF+SSL proposal, I'm not sure if they'll manage to provide a pure-PHP implementation for non-SSL-enabled hosts, though.

Features / Getting Started

This is a quick walk-through to introduce the current version.
Login / Signup
Log in with your (ideally non-XRDS) OpenID and pick a user name.

knowee login

Account setup
Knowee only supports a few services so far. Adding new ones is not hard, though. You can enable the SG API to auto-discover additional accounts. Hit "Proceed" when you're done.

knowee accounts

Profile setup
You can specify whether to make (parts of) your consolidated profile public or not. During the initial setup process, this screen will be almost empty, you can check back later when the semantic bots have done their job. Hit "Proceed".

knowee profile

Dashboard
The Dashboard shows your personal activity stream (later versions may include your contacts' activities, too), system information and a couple of shortcuts.
knowee dashboard

Contacts
The contact editor is still work in progress. So far, you can filter the list, add new entries, and edit existing contacts. The RDF editor is still pretty basic (Changes will be saved to a separate RDF graph, but deleted/changed fields may re-appear after synchronization. This needs more work.) The editor is schema-based and supports the vocabularies mentioned above. You'll be able to create your own fields at some later stage.

It's already possible to import FOAF profiles. Knowee will try to consolidate imported contacts so that you can add data from multiple sources, but then edit the information via a single form. The bot processor is extensible, we'll be able to add additional consolidators at run-time, it only looks at "owl:sameAs" at the moment.
knowee contacts

Enabling the SPARQL API
In the "Settings" section you'll find a form that lets you activate a personal SPARQL API. You can enable/protect read and/or write operations. The SPARQL endpoint provides low-level access to all your data, allows you to explore your social graph, or lets you create backups of your activity stream.

knowee api knowee api

That's more or less it for this version. You can always reset or delete your account, and manually delete incorrectly monitored graphs. The knowee.net system is running on the GoGrid cloud, but I'm still tuning things to let the underlying RDF CMS make better use of the multi-server setup. If things go wrong, blame me, not them. Caching is not fully in place yet, and I've limited the installation to 100 accounts. Give it a try, I'd be happy about feedback.

dooit - a live Getting Real experiment

I created an RDF app following the Getting Real approach
dooitI've probably read Getting Real half a dozen times since the release of the free online version last year. The agile process seems to fit quite nicely with RDF-based tools (Semantic CrunchBase was the most recent proof of concept for me). I'm currently writing a DevX article about using RDF and SPARQL in combination with Getting Real and wondered about quantitative numbers for such an approach. As I usually don't record hours for personal projects, I had to create a new one: sillily named "dooit", a to-do list manager.

dooit follows a lot of GR suggestions such as "UI first", not wasting too much time on a name, that less may be enough for 80% of the use cases, or that usage patterns may evolve as "just-as-good" replacements of features ("mm-dd" tags could for example enable calendar-like functionality).

I started the live experiment on Friday and finished the first iteration on Saturday. Below is a twitter log of the individual activities. I was using Trice as a Web framework, otherwise I would of course have spent much more time on generating forms and implementing AJAX handlers etc. So, the numbers only reflect the project-specific effort, but that's what I was interested in.
  • (Fr 08:24) trying the "Getting Real" approach for a small RDF app
  • (Fr 10:51) idea: a siiimple to-do list with taggable items
  • (Fr 11:02) nailing down initial feature set: ~15mins: add, edit, tick off taggable to-do items
  • (Fr 11:02) finding a silly product name: ~5mins: "dooit"
  • (Fr 11:27) creating paper sketches: ~20mins (IIRC, done yesterday evening)
  • (Fr 11:42) got unreal by first spending ~30mins on a logo
  • (Fr 12:07) Setting up blank Trice instance and basic layout to help with HTML creation: ~25mins
  • (Fr 13:52) first dooit HTML mock-up and CSS stylesheet: ~90mins
  • (Fr 17:14) JavaScript/AJAX hooks for editing in place, forms work, too, but w/o data access on the server: ~3h
  • (Fr 18:12) identifying RDF terms for the data structures: ~30min
  • (Fr 18:13) gotta run. time spent so far for creating RDF from a submitted form: 20mins
  • (Sa 14:40) continuing Getting Real live experiment
  • (Sa 14:41) "URIs everywhere" is one of the main issues for agile development of rdf-based apps. Will try to auto-gen them directly from the forms..
  • (Sa 19:04) rdf infrastructure work to auto-generate RDF from forms and to auto-fill forms from RDF: ~2h
  • (Sa 19:07) functions to send form data to RDF store via SPARQL DELETE/INSERT calls: ~1h
  • (Sa 19:09) replacing mockup template sections with SPARQL-generated snippets: ~1h (CRUD and filter-by-tag now in place, just ticking off items doesn't work yet)
  • (Sa 20:09) implementing rest of initial feature set, tests, fine-tuning: ~1 h. done :)
  • (Sa 20:14) Result of Getting Real experiment: http://semsol.org/dooit Got Real in ~10 12 hours
I think I can call it a success so far. One point about GR is staying focused, working from the UI to the code helps a lot here (as does live-logging, I guess ;). But I'm not done yet. Now that I have a first running version, I still have to see if my RDF-driven app can evolve, if the code is manageable and easy to change. I'm looking forward to finding that out, but my shiny new dooit list suggests to finish the DevX article first ;)

Semantic Web by Example: Semantic CrunchBase

CrunchBase is now available as Linked Data including a SPARQL endpoint and a custom API builder based on SPARQLScript.
Update: Wow, these guys are quick, there is now a full RSS feed for CrunchBoard jobs. I've tweaked the related examples.

This post is a bit late (I've even been TechCrunch'd already), but I wanted to add some features before I fully announce "Semantic CrunchBase", a Linked Data version of CrunchBase, the free directory of technology companies, people, and investors. CrunchBase recently activated an awesome API, with the invitation to build apps on top of it. This seemed like the ideal opportunity to test ARC and Trice, but also to demonstrate some of the things that become possible (or much easier) with SemWeb technology.

Turning CrunchBase into a Linked Dataset

The CB API is based on nicely structured JSON documents which can be retrieved through simple HTTP calls. The data is already interlinked, and each core resource (company, person, product, etc.) has a stable identifier, greatly simplifying the creation of RDF. Ideally, machine-readable representations would be served from crunchbase.com directly (maybe using the nicely evolving Rena toolkit), but the SemWeb community has a reputation of scaring away maintainers of potential target apps with complicated terminology and machinery before actually showing convincing benefits, so, at this stage (and given the nice API), it might make more sense to start with a separate site, and to present a selection of added values first.

For Semantic CrunchBase, I wrote a largely automated JSON2RDF converter, i.e. the initial RDF dataset is not using any known vocabs such as FOAF (or FOAFCorp). (We can INSERT mapping triples later, though.) Keeping most of the attribute names from the source docs (and mainly using just a single namespace) has another advantage besides simplified conversion: CrunchBase API users can more easily experiment with the SPARQL API (see twitter.json and twitter.rdf for a direct comparison).

An important principle in RDF land is the distinction between a resource and a page about a resource (it's very unlikely to hear an RDFer say "URLs are People" ;). This means that we need separate identifiers for e.g. Twitter and the Twitter description. There are different approaches, I decided to use (fake-)hash URIs which make embedding machine-readable data directly into the HTML views a bit more intuitive (IMHO):
  • /company/twitter#self denotes the company,
  • GETing the identifier resolves to /company/twitter which describes the company.
  • Direct RDF/XML or RDF/JSON can be retrieved by appending ".rdf" to the document URIs and/or via Content Negotiation.
This may sound a bit complicated (and for some reason RDFers love to endlessly discuss this stuff), but luckily, many RDF toolkits handle much of the needed functionality transparently.

The instant benefit of having linked data views is the possibility to freely explore the complete CrunchBase graph (e.g. from a company to its investors to their organizations to their relations etc.). However, the CrunchBase team has already done a great job, their UI already supports this functionality quite nicely, the RDF infrastructure doesn't really add anything here, functionality-wise. There is one advantage, but it's not obvious: An RDF-powered app can be extended at any time. On the data-level. Without the need for model changes (because there is none specified). And without the need for table tweaks (the DB schema is generic). We could, for example, enhance the data with CrunchBoard Jobs, DBPedia information, or profiles retrieved from Google's Social Graph API, without having to change a single script or table. (I switched to RDF as productivity booster some time ago and never looked back. The whole Semantic CrunchBase site took only a few person days to build, and most of the time was spent on writing the importer.) But let's skip the backstage benefits for now.

SPARQL - SQL for the Web

Tim Berners-Lee recently said that the success of the Semantic Web should be measured by the "level of unexpected reuse". While the HTML-based viewers support a certain level of serendipitous discovery, they only enable resource-by-resource exploration. It is not possible to spot non-predefined patterns such as "serial co-founders", or "founders of companies recently acquired". As an API provider, it is rather tricky to anticipate all potential use cases. On the CB API mailing list, people are expressing their interest in API methods to retrieve recent investments and acquisitions, or social graph fragments. Those can now only be coded and added by the API maintainers. Enter SPARQL. SPARQL, the protocol and query language for RDF graphs provides just this: flexibility for developers, less work for API providers. Semantic CrunchBase has an open SPARQL endpoint, but it's also possible to restrict/control the API while still using an RDF interface internally to easily define and activate new API methods. (During the last months I've been working for Intellidimension; they were using an on-request approach for AJAX front-ends. Setting up new API methods was often just a matter of minutes.)

With SPARQL, it gets easy to retrieve (almost) any piece of information, here is an example query that finds companies that were recently acquired:
SELECT DISTINCT ?permalink ?name ?year ?month ?code WHERE {
    ?comp cb:exit ?exit ;
          cb:name ?name ;
          cb:crunchbase_url ?permalink .

    ?exit cb:term_code ?code ;
          cb:acquired_year ?year ;
          cb:acquired_month ?month .
}
ORDER BY DESC (?year) DESC (?month)
LIMIT 20
(Query result as HTML)

Or what about a comparison between acquisitions in California and New York:
SELECT DISTINCT COUNT(?link_ca) as ?CA COUNT(?link_ny) as ?NY WHERE {
    ?comp_ca cb:exit ?exit_ca ;
             cb:crunchbase_url ?link_ca ;
             cb:office ?office_ca .
    ?office_ca cb:state_code "CA" .

    ?comp_ny cb:exit ?exit_ny ;
             cb:crunchbase_url ?link_ny ;
             cb:office ?office_ny .
    ?office_ny cb:state_code "NY" .
}
(Results)

These are just some simple examples, but they (hopefully) illustrate how RDF and SPARQL can significantly improve Web app development and community support. But hey, there is more.

Semantic Mashups with SPARQLScript

SPARQL has only just become a W3C recommendation, and the team behind it was smart enough to not add too many features (even the COUNT I used above is not part of the core spec). The community is currently experimenting with SPARQL extensions, and one particular thing that I'm personally very interested in is the creation of SPARQL-driven mashups through something called SPARQLScript (full disclosure: I'm the only one playing with it so far, it's not a standard at all). SPARQLScript enables the federation of script block execution across multiple SPARQL endpoints. In other words, you can integrate data from different sources on the fly.

Imagine you are looking for a job in California at a company that is at a specific funding stage. CrunchBase knows everything about companies, investments, and has structured location data. CrunchBoard on the other hand has job descriptions, but only a single field for City and State, and not the filter options to match our needs. This is where Linked Data shines. If we find a way to link from CrunchBoard to CrunchBase, we can use Semantic Web technology to run queries that include both sources. And with SPARQLScript, we can construct and leverage these links. Below is a script that first loads the CrunchBoard feed of current job offers (only the last 15 entries, due to common RSS' limitations/practices, the use of e.g. hAtom could allow more data to be pulled in). In a second step, it uses the company name to establish a pattern join between CrunchBoard and CrunchBase, which then allows us to retrieve the list of matching jobs at (at least) stage-A companies with offices in California.
PREFIX cboard: <http://www.crunchboard.com>
ENDPOINT <http://cb.semsol.org/sparql>
# refresh feed
if (${GET.refresh}) {
 # replaced <http://feeds.feedburner.com/CrunchboardJobs> with full feed
 LOAD <http://www.crunchboard.com/rss/affiliate/crunchboardrss_all.xml>
}
# let's query
$jobs = SELECT DISTINCT ?job_link ?comp_link ?job_title ?comp_name WHERE {
  # source: crunchboard, using full feed now
  GRAPH <http://www.crunchboard.com/rss/affiliate/crunchboardrss_all.xml> {
    ?job rss:link ?job_link ;
         rss:title ?job_title ;
         cboard:company ?comp_name .
  }
  # source: full graph
  ?comp a cb:Company ;
        cb:name ?comp_name ;
        cb:crunchbase_url ?comp_link ;
        cb:office ?office ;
        cb:funding_round ?round .
  ?office cb:state_code "CA" .
  ?round cb:round_code "a" .
}
(You can test it, this really works.)

Now that we are knee-deep in SemWeb geekery anyway, we can also add another layer to all of this and
  • allow parameterized queries so that the preferred state and investment stage can be freely defined,
  • add a browser-based tool for the collaborative creation of custom API calls
  • add a template mechanism for human-friendly results

I'll write about this "Pimp My API" app at Semantic CrunchBase in the next post. Here are some example API calls that were already created with it:
A lot of fun, more to come.

Looking for paid (Semantic Web) Projects

I could need more paid projects..
Update 2: Yay, I think I'm safe for the next couple of months, should have blogged much earlier. Now I'm starting to think we could really need a Job site for SemWeb people..

Update: Ah, the blogosphere. I already received some replies. One to share: Aduna is looking for a Java Engineer.

About a year ago, I received some funds which allowed me to re-write the ARC toolkit, and also to bring Trice (a semantic web application framework for PHP) to production-readiness. However, Semantic Web Development is generally still very new, especially in the Web Agency market where I'm coming from. It's not that easy yet to keep things self-sustaining.

May well be that I should blog less about bleeding-edge experiments, but rather about how RDF and SPARQL allow me to deploy extensible websites at a fraction of the time it used to take in the past. "Release Early", "Data First", "Evolve on the Fly", and all those patterns that SemWeb technology enables in a web development context.

Anyway, to keep things short: I'm actively (read: urgently ;-) looking for more paid projects. I'm a Web development all-rounder with particular interest in scripting languages and quite some experience in delivering RDF and frontend solutions (more details on my profile page). While it would of course be great to work on stuff where I can use my tools, I'm available for more general web development as well. I'm most productive when I can work from my office, but temporary travelling is basically fine, too. The Düsseldorf Airport is just minutes away.

Cheers in advance for suggestions,

DriftR Linked Data Browser and Editor (Screencast)

A screencast of DriftR, an RDF browser/editor for Trice
While I'm unfortunately struggling to find paid projects these days, I had at least some time to work on core technology for my Trice framework and a new knowee release. The latest module is an in-browser RDF viewer and editor for Linked Data, heavily inspired by the freebase UI (hopefully with less screen flickering, though).

I'm clearly not there yet, but today I uploaded a screencast (quicktime 4MB), and I think I can start incorporating it into the knowee tools soon. Have fun watching it if you like, and Merry X-Mas!

DriftR Screencast

Experimental ARC mailing list

A public group mailing list for ARC
ARC RDF CLasses for PHP I'm still working on the new website for ARC, but I managed to set up a group mailing list yesterday. It's a little (*cough*) experimental, based on ARC2 and Trice (another forthcoming semsol product). So, this is a shout-out to ARC users and developers with an invitation to subscribe and help me test that "DIY SPARQL Mailman" before I do a proper announcement for the new site and community tools (hopefully later this week).

Thanks in advance,
Benji

Archives/Search

YYYY or YYYY/MM
No Posts found

Feeds