RDFa and HTML5: UK Government Experience

By Jeni Tennison
September 4, 2008 | Comments: 4

There was a lot of discussion on the WHATWG mailing list last week about the role and utility of RDFa, whether it’s something that should be supported in HTML5, and what that support should look like.

The objections to adding RDFa to HTML5 seem to fall into three general categories, which I’d paraphrase as:

  • It’s not useful or practical to add (meta)data to HTML pages. The vast majority of people writing web pages won’t add it; it won’t be accurate (either unintentionally or maliciously); and anyway, it should be served in a separate file.
  • CURIEs are offensive. They use namespace declarations, which are the spawn of the devil; even if they used another syntax to do prefix binding, prefix binding itself is bad design; URIs could be used without being shortened; and anyway, there’s no need to name properties with URIs.
  • Adding lots of new attributes to HTML5 is a bad idea; (meta)data should be layered on top of existing markup, like CSS; RDFa markup is ugly.

Ben Adida, Manu Sporny of Digital Bazaar, Dan Brickley and others have been doing a good job explaining the rationale behind the design of RDF and RDFa to WHATWG. Unfortunately, I have the distinct impression that the two communities have such divergent requirements that there’s unlikely to be a meeting, or even opening, of minds any time soon.

Part of the problem, perhaps, is that a lot of work on the semantic web can seem quite academic, theoretical or just plain science fiction. Yeah, it’d be great if I had an artificial agent that would re-book my dentist appointment after booking my hotel and flight for an overseas trip, but this isn’t going to happen any time soon. In the meantime projects that provide visualisations of RDF graphs or searches over RDF data are clever and technically interesting but don’t seem to have much purpose for every-day web users.

What’s opened my eyes about the real-world possibilities of the web of data is a long-term project I’m engaged in through TSO and for OPSI (the part of the UK Government that is responsible for public sector information) to add RDFa to the London Gazette. (For details, see the XTech 2008 paper and OpenTech 2008 presentation.) This project is likely to be just the start of the UK Government’s use of RDFa, and here’s why.

The UK Government holds lots of interesting public sector information (PSI) and wants to make that information easily available for others to reuse in their own applications. To give an idea of what they want to do, check out the ideas submitted to the Show Us A Better Way competition run by the UK’s Power of Information Task Force. The competition attracted hundreds of entries describing what could be done with various kinds of public sector information if it were available. (I don’t know if this is more indicative of the hunger for the information, the inventiveness of Web 2.0 geeks, or the motivational power of £20,000; probably a bit of each.)

Now, sometimes public sector information simply isn’t published on the web at all. Sometimes it is made available, but in PDFs. But often it’s simply obfuscated by being embedded within complex, messy web pages.

The state-of-the-art method of getting hold of this information is screen-scraping. The most famous example in the UK PSI world is the Parliament Parser which takes various publications from the Houses of Parliament and processes it into XML which is then re-presented on TheyWorkForYou.com and analysed on The Public Whip. But screen-scraping is both difficult and hazardous because it usually relies on consistencies within a web page that are unintentional on the part of the author and may disappear overnight when the site is redesigned.

So why not publish the data in separate files; in XML, in JSON, in anything that would make it accessible? Well, UK public sector information is distributed around local, regional and national websites, in pages published by councils, agencies, and departments. Like the larger web, the websites are generated in diverse ways:

  • by individual webmasters hand-editing pages
  • by content management systems with hand-authored content
  • by automated systems programmatically generated pages

However it’s produced, there can be barriers to putting up a separate file containing the data:

  • the system might not be capable of publishing anything other than HTML
  • the author might not have permissions to publish anything other than HTML
  • the author might not know how to publish anything other than HTML
  • having separate versions of the same data might lead to maintenance problems
  • the author might not have the time to produce separate versions of the data

Pragmatically, it’s just not feasible to expect people to be able to publish the data in separate files when they’re already publishing it in HTML. So the challenge is getting the data exposed within the HTML in a way that applications can use it.

Note that I’m not talking here about browsers using the data, although that is an important use case for RDFa. RDFa shares with microformats the ability to mark up items on the page in a way that browsers could use to highlight a phrase and provide a suitable pop-up menu, for example. Or that would enable someone to select a portion of the page and pass it to Ubiquity for further processing.

I’m also not talking about search engines using (meta)data to give supplemental information to searches, in the way that Yahoo! Search does or to provide focused searches as with the Google Social Graph API. Perhaps, as metadata becomes more prevalent, it will become a more useful source of information. Naturally, search engines will need to learn how to distinguish good metadata from bad metadata to avoid being gamed or reporting inaccurate information, but this is nothing they haven’t done before.

What I am talking about is the ability for ordinary HTML pages to become sources of retrievable data. And the fact that RDFa (and microformats) allows the data to be embedded in the content of the page is significant because it enables authors who don’t have access to the whole page (in particular the head of the page) to nevertheless add (meta)data to their page.

One of Ian Hickson’s mails discussed how authors are too lazy to add (meta)data to pages and too evil for that data to be trusted, so the only solution for applications to understand the information in a web page is natural language processing. I think there’s some truth to that, particularly in the general case, but there are cases where authors can be motivated, can be trusted (enough), and can be assisted by something less powerful than an artificial intelligence. In the case of the London Gazette:

  • The authors are being paid to add semantics to the web pages; it’s their job. They don’t have the option of being lazy.
  • The statements that we’re making within these pages can’t just be added by just anyone. The data has a highly trustworthy provenance: the London Gazette is an official publication of the UK Government, and the information it holds has legal weight.
  • Some of the data is collected through forms in the first place, and thus moderately structured. We want to use RDFa simply to expose the structures we already know about.
  • We’re also looking at using relatively simple regular-expression and gazetteer-based processing to identify relevant structures in the unstructured notices that we’re dealing with. Because there’s a limited set of natural language that we have to deal with, we can do this pretty effectively. What we want is to do this processing once, on the server side rather than having all the different clients that access the pages do their own processing. So we want to expose the semantics that we’ve identified through natural language processing by embedding that information within the HTML page.

So for us, RDFa is a really useful tool. Without it, or something like it, we won’t be able to use HTML5.

Of course I haven’t addressed whether the way RDFa embeds data is the right design for HTML5; hopefully I’ll have time to talk about that in a later post.


You might also be interested in:

4 Comments

>It’s not useful or practical to add (meta)data to HTML
>pages. The vast majority of people writing web pages
>won’t add it

A related problem is this: what is their incentive? How does it improve their lives enough to be worth the trouble?

The answer to both issues is that we should forget about people adding metadata to web pages and concentrate on institutions with a market reason for adding metadata. When we check train, airplane, or movie times, the pages we see aren't created by people; they're created dynamically by some HTML-generation process. Modifying these processes to add some RDFa wouldn't be very difficult, and these institutions have a clear incentive to make this data more accessible to more agents: spreading the word even further about when their services are available can drive more business to them.

Of course, the UK government is another institution, and it's great to see them leading the way here.

Bob

"we should forget about people adding metadata to web pages and concentrate on institutions with a market reason for adding metadata"

I think that's exactly right. As a UK Civil Servant working on how we make government's information easier for people to re-use, I believe RDFa has much promise (as does GRDDL). It gives us a route to enable our data for re-use stepwise, in an incremental way - adding (meta)data bit by bit. The work with the London Gazette is a template to show how this can be done by others in government, and by other governments.

The incentive for governments of course, is that making public sector information more widely available for re-use spurs innovation, for both economic and social value. We just need to find some pragmatic (and relatively inexpensive) ways of going about it. RDFa is a good approach in that respect.

And why does it matter what the government does? Well, in the UK the government is one of the largest primary producers of information - the same is true of many other countries. Mapping, statistics, company information, medical information, environmental data the list goes on and on. Yet only a fraction of the information that could be made available by the government is done so as re-usable data on the web. That's not just true in the UK, but everywhere.

RDFa offers the governments of the world a way of adapting existing database driven sites to serve up data, in a way that fits with how content is generally served by the public sector to the web. An important use-case that RDFa satisfies well.

While it may be unreasonable to expect average content authors to add good metadata to their content, it may not be unreasonable to expect next-generation authoring tools (common blogging platforms and CMSs) to automate some of this.

I can envision a future in which authors could make use of the RDFa about attribute to, say, explicitly link a movie review in a blog post to an IMDB entry, or something of the sort. Or, failing that, a future authoring platform could be intelligent enough to make that connection itself.

Right. On addition, though: Trusted metadata only works with trusted players. Most instutional providers – governments, but also most publishers and libraries, to name only a few – will both be able to provide high quality metadata relatively easily (at least in some settings) and often have an incentive to do so.

Unfortunately, there'll also always be numerous players who are just as motivated to provide misleading metadata to attract traffic to their illicit offerings. Until we find solid mechanisms to automatically filter those illigimate providers, we'll need webs of trust for dependable providers to reliable aggregate metadata. This is also one of the assumptions of the syndication protocols developed for semantic descriptions by the eGov-Share Workshop (http://www.egovpt.org/fg/Working_area)

News Topics

Recommended for You

Got a Question?