finally a bnode with a uri

Posts tagged with: crunchbase

CrunchBase Interview

I've been interviewed by the CrunchBase team.
Semantic CrunchBase seems to be worth the time I'm putting into it. Thanks to TechCrunch's and CrunchBase' great move to open their data and encourage reuse (and writing about the apps that use their API), I've had the chance to do a couple of SemWeb demos and reach out to the audience that could benefit as much (or maybe even more) from RDF & Co. as the groups we already have on board: Web app developers.

I also got an offer to write some related articles for DevX, and the CrunchBase team just published an interview where I (shamelessly) promote SemWeb development. I am already noticing an increased number of mails asking for RDF introductions, and people are even starting to just figure things out on their own, with friendly SPARQL paving the path.

This might be the right time for a SWEO II (with a focus on the "E") or a similar effort driven by the RDF community.

How to use CBbot

Some simple instructions for the CrunchBase Twitter bot
OK, looks like the CrunchBase bot got some attention after Fred Wilson's post about possibly handy phone apps. So, if you discovered Semantic CrunchBase or the related bot via non-SemWebby paths, the whole "Define your own API commands with SPARQL" is probably a bit too much, tech-wise. Here are some short instructions for using the bot:

The syntax is basically just "@cbbot, command" where command has to match one of the user-defined commands. Some of them are generating HTML and might therefore not be suited for Twitter access. The main, Twitter-optimized commands are:

  • ${role} (of|at) ${company}: This command is for requests like "@cbbot, Intern at TechCrunch", "@cbbot, founder of facebook", "@cbbot, board of Pandora", "@cbbot, ceo of twitter", etc. (I made the command case-insensitive today, BTW)
  • link to ${keyword}: This command returns a CrunchBase link for the given keyword. The latter can be a company (name or CB identifier), product (name or CB identifier), or person (CB identifier). Examples: "@cbbot, Link to foodzie", "@cbbot, link to EC2", "@cbbot, link to michael-arrington".

Would you like to see additional twitter commands, but don't know about SPARQL or how to use the command editor? Please send command requests to me or to the bot and I'll try to add them.

CrunchBase Twitter Bot

Semantic CrunchBase features a bot that replies to "Pimp My API" commands via twitter
Update (2008-08-25): I've written a follow-up post explaining the main commands.

Heh, as John Crow points out, a cool way to look at work is to think in terms of being "between vacations". So, here is a fun/experimental hack from between my previous (bday on Santorini) and next (family visit at Lake Constance) vacation: It's a tweetsphere cousin of SPARQLBot who can give answers to user-defined CrunchBase API commands. (Instructions for the "Pimp MY API" tool.)

In order to use the bot, just send a known command call to cbbot (using the @-convention), for example "@cbbot, founder of Flickr". A tweet with the answer should appear on your "Replies" tab (or under "Recent", if you are following cbbot, see screenshot below).

CrunchBase Twitter Bot

The bot is implemented as a long-running PHP process (*cough*), you may have to re-start the script in case you don't get an answer within a few minutes.

Pimp My (CrunchBase) API

Define your own CrunchBase API commands with SPARQL
In the Semantic CrunchBase announcement, we saw how SPARQL can be used to retrieve fine-grained information from the CrunchBase graph. This follow-up post explains "Pimp My API", a browser-based tool for creating tailored API calls by combining SPARQL with input parameters and output templating. The command editor consists of three tabs: "Define it", "Test it", and "Activate it".

Step 1: Define a new API command

PIMP MY API (1) In the 1st field ("Command") you define a (human-readable) command, with input parameters set via the ${parameter_name} notation. In the screenshot on the left, we created "${role} of ${comp_name}" which we are going to use to retrieve persons with a specific role at a given company. The command processor will automatically assign variables for a matching input string, e.g. "Editor of TechCrunch" will set the variable ${role} to "Editor", and ${comp_name} to "TechCrunch".

Now on to the 2nd field ("SPARQLScript code"):
SPARQLScript is an experiment to extend SPARQL with scripting language features such as variable assignments, loops, etc. (think Pipes for SemWeb developers). If you are familiar with SPARQL, you will notice only three differences to a standard SPARQL query: In the first line, we are setting a target SPARQL service for the following script blocks. In the second line, we assign the results form the SELECT query to a variable, and the the third difference is the use of placeholders in the query. These placeholder will be filled from matching variables before the query is sent to the target endpoint.

If you don't know SPARQL at all, here is a pseudo-translation of the query: Find resources (?comp) with a cb:name (cb is the CrunchBase namespace used for CB attributes) that equals the input parameter "comp_name", and a relationship (?rel). The relationship should have an attribute ?role which regex-matches the input parameter "role". The relationship should also have a cb:person attribute (linking to ?person). The ?person node should have the cb:first_name and cb:last_name attributes. Those should be returned by the query as "fname" and "lname" variables. The whole result set is then assigned to a variable named "rows" (Hmm, maybe the SPARQL is easier to read than my explanation ;)

The third form field lets us define an output template. Each stand-lone section surrounded by quotation marks will fill the output buffer. Thus, looping through the "rows" will create a small name snippet for each row. Again, placeholders will be filled with values from the current script scope.

Step 2: Test your new Command

PIMP MY API (2) Using the Test form, we can see if our command pattern works, and if the result is formatted as desired. Should anything go wrong, we can select "Show raw output" to get some debugging information. Please note, even though we are using a browser, simple HTML forms, and a friendly pattern language, the commands are sent to real Web services. A broken script usually just hurts your local machine. A distributed Semantic Web processor like this, however, may harm other people's servers, so we should be careful, start small, and improve our script incrementally. In this case, the output result is a little ugly, so we could improve the output template and inject commas:

Step 3: HTTP access activation

Our command is now defined and successfully tested, let's turn it into a public API call.
Instead of the sort-of natural language command, the API expects GET or POST arguments.

The example above generates a plain text result, but it's also possible to return markup or other formats. SPARQLScript can access GETvariables via ${GET.var_name}, this feature can be used to create different output, depending on e.g. a "format" parameter. I'm also working on support for content negotiation, where you'd simply create a "${rows}" template and the SPARQLScript processor would auto-generate an appropriate serialization including correct HTTP headers.

Step 4: Have some fun

You may wonder why the command editor allows the definiton of a human-friendly pattern, when the API itself just needs the parameters. The patterns allow the implementation of an API call detector, i.e. depending on the input stream at a generic service URL, we can auto-detect the right script to run. I've test-implemented a Twitter bot that can reply to messages that match a stored API command on Semantic CrunchBase (Inactive during the week-end, it's not tested enough. Stay tuned ;). Here is a teaser screenshot for next week:


Semantic Web by Example: Semantic CrunchBase

CrunchBase is now available as Linked Data including a SPARQL endpoint and a custom API builder based on SPARQLScript.
Update: Wow, these guys are quick, there is now a full RSS feed for CrunchBoard jobs. I've tweaked the related examples.

This post is a bit late (I've even been TechCrunch'd already), but I wanted to add some features before I fully announce "Semantic CrunchBase", a Linked Data version of CrunchBase, the free directory of technology companies, people, and investors. CrunchBase recently activated an awesome API, with the invitation to build apps on top of it. This seemed like the ideal opportunity to test ARC and Trice, but also to demonstrate some of the things that become possible (or much easier) with SemWeb technology.

Turning CrunchBase into a Linked Dataset

The CB API is based on nicely structured JSON documents which can be retrieved through simple HTTP calls. The data is already interlinked, and each core resource (company, person, product, etc.) has a stable identifier, greatly simplifying the creation of RDF. Ideally, machine-readable representations would be served from directly (maybe using the nicely evolving Rena toolkit), but the SemWeb community has a reputation of scaring away maintainers of potential target apps with complicated terminology and machinery before actually showing convincing benefits, so, at this stage (and given the nice API), it might make more sense to start with a separate site, and to present a selection of added values first.

For Semantic CrunchBase, I wrote a largely automated JSON2RDF converter, i.e. the initial RDF dataset is not using any known vocabs such as FOAF (or FOAFCorp). (We can INSERT mapping triples later, though.) Keeping most of the attribute names from the source docs (and mainly using just a single namespace) has another advantage besides simplified conversion: CrunchBase API users can more easily experiment with the SPARQL API (see twitter.json and twitter.rdf for a direct comparison).

An important principle in RDF land is the distinction between a resource and a page about a resource (it's very unlikely to hear an RDFer say "URLs are People" ;). This means that we need separate identifiers for e.g. Twitter and the Twitter description. There are different approaches, I decided to use (fake-)hash URIs which make embedding machine-readable data directly into the HTML views a bit more intuitive (IMHO):
  • /company/twitter#self denotes the company,
  • GETing the identifier resolves to /company/twitter which describes the company.
  • Direct RDF/XML or RDF/JSON can be retrieved by appending ".rdf" to the document URIs and/or via Content Negotiation.
This may sound a bit complicated (and for some reason RDFers love to endlessly discuss this stuff), but luckily, many RDF toolkits handle much of the needed functionality transparently.

The instant benefit of having linked data views is the possibility to freely explore the complete CrunchBase graph (e.g. from a company to its investors to their organizations to their relations etc.). However, the CrunchBase team has already done a great job, their UI already supports this functionality quite nicely, the RDF infrastructure doesn't really add anything here, functionality-wise. There is one advantage, but it's not obvious: An RDF-powered app can be extended at any time. On the data-level. Without the need for model changes (because there is none specified). And without the need for table tweaks (the DB schema is generic). We could, for example, enhance the data with CrunchBoard Jobs, DBPedia information, or profiles retrieved from Google's Social Graph API, without having to change a single script or table. (I switched to RDF as productivity booster some time ago and never looked back. The whole Semantic CrunchBase site took only a few person days to build, and most of the time was spent on writing the importer.) But let's skip the backstage benefits for now.

SPARQL - SQL for the Web

Tim Berners-Lee recently said that the success of the Semantic Web should be measured by the "level of unexpected reuse". While the HTML-based viewers support a certain level of serendipitous discovery, they only enable resource-by-resource exploration. It is not possible to spot non-predefined patterns such as "serial co-founders", or "founders of companies recently acquired". As an API provider, it is rather tricky to anticipate all potential use cases. On the CB API mailing list, people are expressing their interest in API methods to retrieve recent investments and acquisitions, or social graph fragments. Those can now only be coded and added by the API maintainers. Enter SPARQL. SPARQL, the protocol and query language for RDF graphs provides just this: flexibility for developers, less work for API providers. Semantic CrunchBase has an open SPARQL endpoint, but it's also possible to restrict/control the API while still using an RDF interface internally to easily define and activate new API methods. (During the last months I've been working for Intellidimension; they were using an on-request approach for AJAX front-ends. Setting up new API methods was often just a matter of minutes.)

With SPARQL, it gets easy to retrieve (almost) any piece of information, here is an example query that finds companies that were recently acquired:
SELECT DISTINCT ?permalink ?name ?year ?month ?code WHERE {
    ?comp cb:exit ?exit ;
          cb:name ?name ;
          cb:crunchbase_url ?permalink .

    ?exit cb:term_code ?code ;
          cb:acquired_year ?year ;
          cb:acquired_month ?month .
ORDER BY DESC (?year) DESC (?month)
(Query result as HTML)

Or what about a comparison between acquisitions in California and New York:
SELECT DISTINCT COUNT(?link_ca) as ?CA COUNT(?link_ny) as ?NY WHERE {
    ?comp_ca cb:exit ?exit_ca ;
             cb:crunchbase_url ?link_ca ;
             cb:office ?office_ca .
    ?office_ca cb:state_code "CA" .

    ?comp_ny cb:exit ?exit_ny ;
             cb:crunchbase_url ?link_ny ;
             cb:office ?office_ny .
    ?office_ny cb:state_code "NY" .

These are just some simple examples, but they (hopefully) illustrate how RDF and SPARQL can significantly improve Web app development and community support. But hey, there is more.

Semantic Mashups with SPARQLScript

SPARQL has only just become a W3C recommendation, and the team behind it was smart enough to not add too many features (even the COUNT I used above is not part of the core spec). The community is currently experimenting with SPARQL extensions, and one particular thing that I'm personally very interested in is the creation of SPARQL-driven mashups through something called SPARQLScript (full disclosure: I'm the only one playing with it so far, it's not a standard at all). SPARQLScript enables the federation of script block execution across multiple SPARQL endpoints. In other words, you can integrate data from different sources on the fly.

Imagine you are looking for a job in California at a company that is at a specific funding stage. CrunchBase knows everything about companies, investments, and has structured location data. CrunchBoard on the other hand has job descriptions, but only a single field for City and State, and not the filter options to match our needs. This is where Linked Data shines. If we find a way to link from CrunchBoard to CrunchBase, we can use Semantic Web technology to run queries that include both sources. And with SPARQLScript, we can construct and leverage these links. Below is a script that first loads the CrunchBoard feed of current job offers (only the last 15 entries, due to common RSS' limitations/practices, the use of e.g. hAtom could allow more data to be pulled in). In a second step, it uses the company name to establish a pattern join between CrunchBoard and CrunchBase, which then allows us to retrieve the list of matching jobs at (at least) stage-A companies with offices in California.
PREFIX cboard: <>
# refresh feed
if (${GET.refresh}) {
 # replaced <> with full feed
 LOAD <>
# let's query
$jobs = SELECT DISTINCT ?job_link ?comp_link ?job_title ?comp_name WHERE {
  # source: crunchboard, using full feed now
  GRAPH <> {
    ?job rss:link ?job_link ;
         rss:title ?job_title ;
         cboard:company ?comp_name .
  # source: full graph
  ?comp a cb:Company ;
        cb:name ?comp_name ;
        cb:crunchbase_url ?comp_link ;
        cb:office ?office ;
        cb:funding_round ?round .
  ?office cb:state_code "CA" .
  ?round cb:round_code "a" .
(You can test it, this really works.)

Now that we are knee-deep in SemWeb geekery anyway, we can also add another layer to all of this and
  • allow parameterized queries so that the preferred state and investment stage can be freely defined,
  • add a browser-based tool for the collaborative creation of custom API calls
  • add a template mechanism for human-friendly results

I'll write about this "Pimp My API" app at Semantic CrunchBase in the next post. Here are some example API calls that were already created with it:
A lot of fun, more to come.


No Posts found