finally a bnode with a uri

Posts tagged with: htaccess

Facilitating Resource Description Discovery with Apache and PHP

Implementing URIQA via htaccess files.
Includes a hack, but I thought it's still worth blogging..

While working on the new semanticweb.org site, I'm facing a lot of issues that haven't been "officially" resolved yet (e.g. by a SWBPD WG recommendation). One of these problems is Resource Description Discovery (RDD).

There are client- and server-side approaches for trying/enabling automatic RDD. I think most people agreed on that basic Scutters (RDF crawlers) should send appropriate Accept headers for RDF/XML data and be able to follow redirects. Advanced RDF clients may try to extract (links to) RDF descriptions embedded in returned HTML pages, images, or XMP-enhanced files. More efficient approaches include doing a HTTP HEAD for a given resource URI first, followed by looking for specific headers in the returned data. Unfortunately, there is no agreed-on header for such information. X-Metadata-Location and Metadata-Location are among the suggested header names. The most efficient proposal I've seen so far is URIQA's MGET HTTP method, which either returns a description of the resource denoted by the URI sent, or it will return a 501 error header in case the server didn't understand the request (I don't know how 404s are handled, though).

So, as a scutter developer, I'd probably try to come up with a program that would
  1. do an MGET on a given URI
  2. if that doesn't return RDF, do a HEAD and try to find pointers or a redirect header to an RDF description. (A standard header that indicates the URL of a SPARQL interface would be handy, too, btw)
  3. if still clueless, it'd try a full GET, detect the returned MIME type and proceed according to that (e.g. find a <link rel="meta" ../> tag in an html doc etc.)

However, implementing the server part is a little bit more complicated. Ignoring the hash vs. slash controversy for a moment, let's assume we have a URI-generation mechanism working for resources whose descriptions we are going to publish on our server. (I'll probably go for something along http://www.semanticweb.org/r/{resourceID}{optional #-suffix}).

HTTP GETting one of these URIs could return a redirect header to either an HTML representation or an RDF description (depending on Accept headers) of the resource. By using distinct URIs for each representation, I should to be able to avoid the problem of URI overloading. As I'm using rewrite rules for serving Web content, I'll be able to provide headers for media items such as images or PDFs as well. HTML pages will contain additional <link/> tags as a human-friendly ("view-source") way to find the RDF data. There are also scutters and tools that look for a <link/> tag only, so it definitely makes sense to include it.

So the only thing that doesn't seem to be implementable is the server part of URIQA. Apache supports only the standard HTTP verbs such as HEAD, GET, POST, or PUT. Adding MGET to that list would require a mod_uriqa extension. And even then there wouldn't be a handy way to catch those MGETs from PHP (the scripting language I'm using): MGET is meant to return unambiguous results only, so it wouldn't be good to auto-forward MGET requests to the PHP processor. An ignorant script would return the same result independent of the request method.

Today, I fine-tuned my custom "catch 404" script a little bit, when the penny dropped. Shouldn't it be possible to catch a 501 and then try to detect if a client sent a URIQA request? It's not the cleanest solution, but as I'm using htaccess files for 404s anyway, I thought I should give it a try. So here is my hack for a PHP/htaccess-based URIQA server implementation:

First, we need an htaccess file to handle 501 errors:
ErrorDocument 501 /catch_501.php
Then, we need a php file that processes these requests. The nice thing is that Apache automatically adds a REDIRECT_REQUEST_METHOD header which can be read from PHP:
<?php

/* detect MGET */
if($_SERVER["REDIRECT_REQUEST_METHOD"]=="MGET"){
  /* use URIQA-uri, if available, otherwise use request URI */
  if(!$uri=$_SERVER["HTTP_URIQA_URI"]){
    $uri="http://".$_SERVER["SERVER_NAME"].$_SERVER["REQUEST_URI"];
  }
  /* check and confirm that we can handle that URI */
  ...
  header("HTTP/1.0 200 OK");
  /* check incoming accept headers */
  ...
  /* retrieve resource description from local triple store */
  ...
  /* return result */
  ...
}

?>

That's it. And this mechanism (sometimes referred to as poor man's rewriting when used with 404s) can even be implemented on any average hosted web server.

Archives/Search

YYYY or YYYY/MM
No Posts found

Feeds