There was a lot of discussion on the WHATWG mailing list last week about the role and utility of RDFa, whether it’s something that should be supported in HTML5, and what that support should look like.
The objections to adding RDFa to HTML5 seem to fall into three general categories, which I’d paraphrase as:
- It’s not useful or practical to add (meta)data to HTML pages. The vast majority of people writing web pages won’t add it; it won’t be accurate (either unintentionally or maliciously); and anyway, it should be served in a separate file.
- CURIEs are offensive. They use namespace declarations, which are the spawn of the devil; even if they used another syntax to do prefix binding, prefix binding itself is bad design; URIs could be used without being shortened; and anyway, there’s no need to name properties with URIs.
- Adding lots of new attributes to HTML5 is a bad idea; (meta)data should be layered on top of existing markup, like CSS; RDFa markup is ugly.
Ben Adida, Manu Sporny of Digital Bazaar, Dan Brickley and others have been doing a good job explaining the rationale behind the design of RDF and RDFa to WHATWG. Unfortunately, I have the distinct impression that the two communities have such divergent requirements that there’s unlikely to be a meeting, or even opening, of minds any time soon.
Part of the problem, perhaps, is that a lot of work on the semantic web can seem quite academic, theoretical or just plain science fiction. Yeah, it’d be great if I had an artificial agent that would re-book my dentist appointment after booking my hotel and flight for an overseas trip, but this isn’t going to happen any time soon. In the meantime projects that provide visualisations of RDF graphs or searches over RDF data are clever and technically interesting but don’t seem to have much purpose for every-day web users.
What’s opened my eyes about the real-world possibilities of the web of data is a long-term project I’m engaged in through TSO and for OPSI (the part of the UK Government that is responsible for public sector information) to add RDFa to the London Gazette. (For details, see the XTech 2008 paper and OpenTech 2008 presentation.) This project is likely to be just the start of the UK Government’s use of RDFa, and here’s why.
The UK Government holds lots of interesting public sector information (PSI) and wants to make that information easily available for others to reuse in their own applications. To give an idea of what they want to do, check out the ideas submitted to the Show Us A Better Way competition run by the UK’s Power of Information Task Force. The competition attracted hundreds of entries describing what could be done with various kinds of public sector information if it were available. (I don’t know if this is more indicative of the hunger for the information, the inventiveness of Web 2.0 geeks, or the motivational power of £20,000; probably a bit of each.)
Now, sometimes public sector information simply isn’t published on the web at all. Sometimes it is made available, but in PDFs. But often it’s simply obfuscated by being embedded within complex, messy web pages.
The state-of-the-art method of getting hold of this information is screen-scraping. The most famous example in the UK PSI world is the Parliament Parser which takes various publications from the Houses of Parliament and processes it into XML which is then re-presented on TheyWorkForYou.com and analysed on The Public Whip. But screen-scraping is both difficult and hazardous because it usually relies on consistencies within a web page that are unintentional on the part of the author and may disappear overnight when the site is redesigned.
So why not publish the data in separate files; in XML, in JSON, in anything that would make it accessible? Well, UK public sector information is distributed around local, regional and national websites, in pages published by councils, agencies, and departments. Like the larger web, the websites are generated in diverse ways:
- by individual webmasters hand-editing pages
- by content management systems with hand-authored content
- by automated systems programmatically generated pages
However it’s produced, there can be barriers to putting up a separate file containing the data:
- the system might not be capable of publishing anything other than HTML
- the author might not have permissions to publish anything other than HTML
- the author might not know how to publish anything other than HTML
- having separate versions of the same data might lead to maintenance problems
- the author might not have the time to produce separate versions of the data
Pragmatically, it’s just not feasible to expect people to be able to publish the data in separate files when they’re already publishing it in HTML. So the challenge is getting the data exposed within the HTML in a way that applications can use it.
Note that I’m not talking here about browsers using the data, although that is an important use case for RDFa. RDFa shares with microformats the ability to mark up items on the page in a way that browsers could use to highlight a phrase and provide a suitable pop-up menu, for example. Or that would enable someone to select a portion of the page and pass it to Ubiquity for further processing.
I’m also not talking about search engines using (meta)data to give supplemental information to searches, in the way that Yahoo! Search does or to provide focused searches as with the Google Social Graph API. Perhaps, as metadata becomes more prevalent, it will become a more useful source of information. Naturally, search engines will need to learn how to distinguish good metadata from bad metadata to avoid being gamed or reporting inaccurate information, but this is nothing they haven’t done before.
What I am talking about is the ability for ordinary HTML pages to become sources of retrievable data. And the fact that RDFa (and microformats) allows the data to be embedded in the content of the page is significant because it enables authors who don’t have access to the whole page (in particular the head of the page) to nevertheless add (meta)data to their page.
One of Ian Hickson’s mails discussed how authors are too lazy to add (meta)data to pages and too evil for that data to be trusted, so the only solution for applications to understand the information in a web page is natural language processing. I think there’s some truth to that, particularly in the general case, but there are cases where authors can be motivated, can be trusted (enough), and can be assisted by something less powerful than an artificial intelligence. In the case of the London Gazette:
- The authors are being paid to add semantics to the web pages; it’s their job. They don’t have the option of being lazy.
- The statements that we’re making within these pages can’t just be added by just anyone. The data has a highly trustworthy provenance: the London Gazette is an official publication of the UK Government, and the information it holds has legal weight.
- Some of the data is collected through forms in the first place, and thus moderately structured. We want to use RDFa simply to expose the structures we already know about.
- We’re also looking at using relatively simple regular-expression and gazetteer-based processing to identify relevant structures in the unstructured notices that we’re dealing with. Because there’s a limited set of natural language that we have to deal with, we can do this pretty effectively. What we want is to do this processing once, on the server side rather than having all the different clients that access the pages do their own processing. So we want to expose the semantics that we’ve identified through natural language processing by embedding that information within the HTML page.
So for us, RDFa is a really useful tool. Without it, or something like it, we won’t be able to use HTML5.
Of course I haven’t addressed whether the way RDFa embeds data is the right design for HTML5; hopefully I’ll have time to talk about that in a later post.