Climategate and XML

By Rick Jelliffe
December 3, 2009 | Comments: 9

One interesting artifact to come out of the stolen Climategate material is an epic file HARRY_READ_ME.txt. It seems to be a year long log by a programmer (Harry?) who has to port old data and various old FORTRAN (and MATLAB?) programs to a new system and compiler, to allow new data series to be added. The old programmers are not available to answer questions.

The kinds of problems he faces will have a lot of sympathy for anyone who has worked in a research environment, or had to port undocumented code that has perhaps been hacked together. Let alone 1980s FORTRAN 77 (if that is what it is.) Harry frequently makes little pleas that he should have just started to rewrite the software from scratch. The software is complicated because it is too big to do in simple stages: there is not enough rooom on their 100gig disks for intermediate data. One great bug turns out to be an infinite loop, triggered by data from the South Pole, with a 0.

But what I find most interesting in the log is how many of the problems that Harry feels strongly enough to write down come down, at the end of the day, to having to use untagged data. There is no-where in the data files to clearly state what the data fields mean: he has to reverse engineer the meaning in some cases. He has a persistent problem that files from different regions have different conventions, and multiple notations are used (0-360 degree, or -180 to 180 degree, for example).

Now that disk space is cheaper, it seems to me that XML's over-tagged design makes a lot of sense for long-term maintenance of scientific data. Extra fields/elements can be added, but everything is labelled. Data integrity can be verified using schemas. Headers or attributes can have the units clearly stated. Attributes can be used to explicitly mark up where values have been corrected or estimated, or where they are regarded as anomalous, perhaps with the old value being kept in place.

The kinds of issues that SGML and XML sought to address, in particular rigorous markup, are not only to help the publishing world escape the nightmares of maintaining troff, TeX and so on with for example Perl, but I think can also help the scientific world escape what we can see as a real maintenance problem with untagged text and binary data files and for example FORTRAN.

Post-XML, people have much stronger expectations that they should be able to use a data file immediately from inspection. XML provides syntax to provide regularity, which otherwise must be done through discipline. Sure, XML is verbose: but I suspect Harry's log would contain much less frustration if XML had been used (which it couldn't, then.)

Public data should be in standard public formats, using rigorous markup. The industrial/techniccal publishing world had similar issues with large uncontrolled datasets, variant formats, and the particular problem of having to have data that still could be used even after a hiatus of a few decades, and the ultimate solution (XML and Schemas) looks a pretty good match for science datasets too, at the cost of larger file sizes (suggesting a standard compressed XML might be most appropriate.)

[Update: By recommending XML over plain text or binary here, I didn't mean to exclude JSON: it has many properties that may make it even preferable to XML for scientific data storage, providing that there are structures in a header to hold the appropriate documentation of fields (datatype, units, range, precision, etc) and appropriate metadata (e.g. Dublin Core metadata.) ]


You might also be interested in:

9 Comments

They're keeping this much data around, and they don't have a serious data repository to keep it in? When talking about research-scale data, 100 GB is small. I'm currently writing and running code that involves computing many trillion similarities, and then throwing around multiple copies of numerous chunks of data in groups of files that each hit dozens of gigabytes (most of the similarities aren't remembered, but quite a few are), and I barely even qualify as a blip on the filesystems I'm working on, available through a medium-sized supercomputer. Even my small research group (a half dozen programmers & doctoral students under a professor) by itself has a couple dozen terabytes of usable disk space lying around. What with how cheap large storage boxes using commodity drives are nowadays (and for the past few years), there's no excuse for a research group where data is a major part of the mission not to have the storage capacity to handle such things.

Russell: Well perhaps you should get some of your funding and hardware and send it to the CRU, if their work is more important to yours? Just kidding :-)

The quote from Harry is

This took longer than hoped.. running out of disk space again. This is why Tim didn't save more of the intermediate products - which would have made my detective work easier. The ridiculous process he adopted - and which we have dutifully followed - creates hundreds of intermediate files at every stage, none of which are automatically zipped/unzipped. Crazy. I've filled a 100gb disk!

This is transient, intermediate data, not the data you would keep in a data repository anyway, is it?

The programs being ported were mainly F77 programs from days when hardware was much cheaper, and the system organized for small disk space (lots of files) and less CPU power (no compression). Neither of those constraints holds today, and this seems to be a factor in Harry's comments that he would have been better off rewriting the whole thing.

Not everyone is on the crest of Moore's law.

That his working system only had a 100gb disk (or partition) is certainly interesting though. It suggests that the CRU has long suffered from funding or procurement difficulties.

I write in another blog that it would great if CRU got funding grants (especially from climate sceptics!) to allow their data sets and programs to be refactored, rewritten, re-debugged, and repackaged by independent software engineers for open source distribution. That would be a win for both sides.

In fact, what would be great would be a competition, where every year the group (including CRU) which provides the best open source climate model (for modeling the historic and current data) gets a cash prize. Any millionaire want to cash up a mill for setting this up and a mill per year prize?

This graph showing solar activity is reversed left to right otherwise most people would recognize the "hockey stick".

http://en.wikipedia.org/wiki/File:Carbon14_with_activity_labels.svg

Based on this graph, it could be "scientifically" argued that C14, not C02 is what is causing global warming.

Greg: The CO2 levels we have now have not been seen for 15 million years; they are almost double the average for the last few hundred thousand years, and about one third more than the peak of the various ice ages.

The so-called hockey stick is how sharp the increase looks on a long time-scale. But all that it means is that the increase mainly has occurred mainly in the last fifty years.

To get some perspective on 15 million years: man (or any hominids) has never had to live in a world with these levels (the dismissal by some geologists that the earth has had these levels before ignores that humans have not had these levels before, it seems to me.) 15 million years ago the great apes and lesser apes were splitting out (the great apes being the chain that ultimately includes humans), and the closest primates may still have been along the lines of Proconsol.

That there would be natural components of climate warming, no-one denies. That there would be other effects at work, no-one denies (for example, the ozone hole has variou heatings and cooling effects).

About .038% of the atmosphere is carbon molecules, almost all of it C02. Only 0.0000000001% of the carbon in the atmosphere (one trillionth) is Carbon14, according to wikipedia. The CO2 has steadily risen by 25% over the last 50 years, if my understanding of the numbers is correct.

For a guide to the CO2 analysis, see here. (The only point I can see that could work against APG, would be the idea that there are other counter-effects, as yet unknown, that will act to balance the human inputs of CO2. But that we don't know them means that they would be currently speculation, not science.)

To add a bit to your comments, I'd like to further point out that now that EXI (http://www.w3.org/TR/exi/) is a a breath away from CR, we can have our cake and eat it too. Data can be stored in compact binary form, still be faster to process than XML (so long as advanced compression isn't used, competitive with hand-crafted binary formats) and still have rigorous labelling.

We certainly had a lot of interest from the research community in this, and I sure look forward to EXI being used for scientific data. Indeed, even with huge storage space parsing terabytes of XML is going to be slow — but that can now be eliminated as an issue as well.

Readers: I see that someone on Reddit has a comment on this blog entry:

"No, Harry's real problem is that the fields and formats are undocumented. Blithely arguing "xml is self-documenting" misses the point.

Any reader can see I never say XML anything like that XML is self documenting. Here is my key passage, repeated for emphasis:

"Now that disk space is cheaper, it seems to me that XML's over-tagged design makes a lot of sense for long-term maintenance of scientific data. Extra fields/elements can be added, but everything is labelled. Data integrity can be verified using schemas. Headers or attributes can have the units clearly stated. Attributes can be used to explicitly mark up where values have been corrected or estimated, or where they are regarded as anomalous, perhaps with the old value being kept in place.

It is all very well to speak in generalities that things should be better documented, but, with data formats, unless there are syntactic requirements for labelling and affordances for easy extensibility, it won't happen. XML seems to fit the bill in this regard.

"About .038% of the atmosphere is carbon molecules, almost all of it C02. Only 0.0000000001% of the carbon in the atmosphere (one trillionth) is Carbon14, according to wikipedia. The CO2 has steadily risen by 25% over the last 50 years, if my understanding of the numbers is correct."

This is why you nuts base your arguments on Gregorian data that more closely resembles scripture than science and takes a true benefit of faith to actually believe. Get of your pop culture science throne and base your findings on accurate data, there are a multitude of flaws with the reasoning and methodologies involved in gathering climate data. In all cases of these studies the research was one sided and set out to justify claims that we are melting our planet.

Yes, pollution is bad, but turning science into propaganda is treason against humanity to a much higher degree.

Jeremy: I don't understand your comment at all. I referenced studies based on the record going back 15 million years, so I don't see what your comment on "Gregorian data" means. To say that "all cases" of studies which disagree with you are based on some nasty agenda looks like blinkered intolerance: how are the shells lying, for example?

Bilgi: CO2? Nothing to do with XML: it was a digression.

My blog says "But what I find most interesting in the log is how many of the problems that Harry feels strongly enough to write down come down, at the end of the day, to having to use untagged data. ... XML's over-tagged design makes a lot of sense for long-term maintenance of scientific data."

You see, the problem is that when I point out that there are neutral technical reasons that contribute to the frustration of the programmer, and that it would be good to overcome these technical problems, I am not fitting in with the allowed narrative of the climate-change deniers.

We are only allowed to say that the science must be mistrusted, the scientists must be mistrusted, the governments must be mistrusted. So of course, if I make a practical comment which is neutral about climate change, sooner or later a reader will try to "balance" the deficiency, in this case with off-topic homemade theories about CO2.

News Topics

Recommended for You

Got a Question?