Linking a public government dataset into the semantic web with RDF

Australian Pharmaceutical Benefits Scheme

By Rick Jelliffe
September 27, 2009 | Comments: 2

Over the years, every time I have looked at RDF the promise seemed to eclipse the reality. Some things have improved: my blog RDFa: not the flopperooni one would expect? is still pretty much my position.

A few months ago (mid 2009), a client wanted to dip their toes in the semantic web. So I took a fresh look at the status quo, and where the current sweet spot is.

I actually felt quite optimistic, because of positive comments by sensible XML industry folk like Jeni Tennison and Bob Du Charme. This is not to imply that the academic or research RDF community are not sensible, but their priorities (or, at least, their deadlines) have not been industry's (or, at least, not my industry's.) Nor do they need to be.

The client is the Australian Government Depart of Health and Aging Pharmaceutical Benefits Scheme website. In Australia, the government negotiates with drug companies for a large range of medicines: in return for being on the list for a particular treatment code for a minimum of two years, the government gets a reasonable discount from the pharma companies, and it subsidizes drugs and promotes the availability of generics. In my recent ill-health over the last few years, I don't think I had to pay more than maybe AU$32 (US$25) for any course of pills, and often half that. It is available to all citizens (and residents): there are extra discounts and availability for some at needy groups.

The system we made for them publishes a new list every month: this includes actual Legislative Instruments, various books, various electronic forms for repurposing, and a website: The website is really five alternate presentations for the same data (for consumers, for industry, for health professionals who are the largest users, plus 2 PDA formats.) It gets about three million hits a month. A particular angle is that pieces of the information are subject to regulation, from the formats to the data to the process to the deadlines.

So coming at RDF from the project angle, I first looked at whether it was feasible or useful to represent the data in RDF, for detailed knowledge representation. I decided it was not, for multiple reasons. The main reason was whenever I followed a trail I ended up having to make up our own semantics. It seemed to add no value compared to having Plain Old XML (POX.)

Disappointed, I looked at using RDFa inside HTML pages, for more modest representation just of parts of the information. The client was keen on using the hProduct system that Google has announced support for. But a couple of things disappointed: the first was that there seemed no benefit for RDFa (in this case) compared to using the terse microformat syntax (we implemented that); the second was that RDFa did not seem to be adequately defined over XML and I did not want to perpetrate any vacuous pioneering on our esteemed client. The W3C Semantic Web Lifesciences mail list was typically friendly and helpful in helping me figure out where the goal posts lie, and I certainly recommend them.

So the more I looked, the more that it seemed that really the kind of RDF information that was suitable was, in effect, little different from that in ISO Topic Maps: linking pages or sections to topics.

Indeed, that is obviously the conclusion that the W3C is itself pushing. Forget the knowledge representation for now, when you think of RDF think of Linked Data. I was so pleased to re-read Tim Berners-Lee's page on this; it made me a lot more confident that I was not being suspicious or incompetent. (...Pause for readership guffaws...)

So you can see the kind of approach I ended up taking in these files:

  • substance.rdf is a list of the substances as linked data
  • atc.rdf is a list of the Anatomical Therepeutic Codes (ATC) as linked data
  • items.rdf is a list of some categorized items

When doing RDF or Topic Maps you first have to try to locate the various URLs for concepts or topics or the objects you are interested in. (If you are a Thomist, perhaps you can think of it as having to proceed from the more to the less well-known.)

I think this is quite important: the essential characteristic of the process is not describing what your data is, in some data modeling sense, but in linking to other resources on the WWW in ways that allow programs to draw conclusions.

Two resources I found essential were

  • DBPEDIA, which is an RDF-ified verison of wikipedia. It has a lot of URLs for things.

  • Life Sciences IDs. These are a simple URN mechanism for allocating IDs to sciency things. The LSID project provides various kinds of resolvers, but I used LSIDs merely as an identifier.

The approach I ended up taking was the most simple thing I thought could work and was meaningful: An RDF entry for each topic (drug etc.) linking to each HTML or XML page in the PBS website and also to one or more well-known descriptions of the same. For the ATC codes, it was easy because there is a WHO website giving the details. For the drugs, the drugbank site was handy.

It really emphasized to me that the kind of approach in PRESTO is absolutely vital. Everything significant needs a good stable URL, otherwise we cannot link to it.


As far as information modeling, the bare minimum was simply to make a single RDF entry for each kind of topic, and label it as a concept. That ties all the items together.

Here is an example diagram substances (slightly changed):

And here is the kind of RDF we deliver:

<rdf:RDF xmlns:rdf=""
  <rdf:Description rdf:about="">
      <rdfs:comment>A molecule or drug</rdfs:comment>
      <rdfs:comment>This is our taxonomy! 
      There is a concept Substance</rdfs:comment>
            rdf:resource=" "/>

<rdfs:label>ABACAVIR SULFATE</rdfs:label>

My conclusion is that using RDF well within its limits is actually very convenient. And, more importantly, the linked data approach is not bogus: by which I mean that the use of the RDF adds a semantic that is not present otherwise and that would be useful for processing.

I regard the RDF in early versions of, say, RSS, as completely bogus in contrast, and it soured me to RDF: bogus because all the RDF pariphenalia didn't seem to result in any added value: all the semantics were private or custom anyway. Which is why RSS could move away from RDF without it altering anything real. I just mean that RDF is a poor facade but a strong foundation.

I think this kind of linked data approach is the egg that the chicken needs. Whether it will be a scrawny bantam or a rampaging Bruhathkayosaurus (err did they rampage?) will be told by time.

But unless you are in the lucky position of already having convenient types in the RDF sense defined by some standard or well-known web page, I don't think data modeling/knowledge representation in RDF is at a stage yet where any arbitrary technical data set will be happily or fully and usefully represented by RDF.

I think there is a cultural/technical/expectation gap at work too: it seems that RDF people expect that an element or attribute name in some XML vocabulary is something that can be disconnected from its context and used as an identifier in RDF expressions. I don't think that is remotely the way that a lot of XML works: the generic identifier or attribute name is interpreted in context for its specific semantics and may not be very interesting by itself.

(I would not be surprised if there are better ways of doing some of the RDF, of course. I'm happy to take advise. But they need to be more discoverable on the WWW in order to make it down to my level!)

You might also be interested in:


A reader commented (and was deleted by filtering, sorry):

"Programming languages may have syntax and sort of vocabulary but expecting to find any semantics in expressions is ridiculous."

Yes, I agree that computers only deal in symbols and numbers as primitives, so any problem at all involves reducing semantics to symbols and numbers and functions on them.

The "semantics" in semantic web involves attempts to give symbols to things that relate to the things themselves, rather than being concerned with the discourse or rendering of them in a medium: so non-semantic is "<bold>hello</bold> <italic>world</italic>" while grammatical is "<greeting><salutation>hello</salutation><target>world</target></greeting>".

In semantic markup (linked data), you would be more interested in "<a href=""><a href="">hello</a> <a href="">world</a></a>"

In full semantic markup, you might be interested in combining all these things: the point of the semantic web is not so much the semantic, but the web: can URIs allow us to create a much richer kind of markup?

using RDF well within its limits is actually very convenient. And, more importantly, the linked data approach is not bogus glass bottle

News Topics

Recommended for You

Got a Question?