One interesting artifact to come out of the stolen Climategate material is an epic file HARRY_READ_ME.txt. It seems to be a year long log by a programmer (Harry?) who has to port old data and various old FORTRAN (and MATLAB?) programs to a new system and compiler, to allow new data series to be added. The old programmers are not available to answer questions.
The kinds of problems he faces will have a lot of sympathy for anyone who has worked in a research environment, or had to port undocumented code that has perhaps been hacked together. Let alone 1980s FORTRAN 77 (if that is what it is.) Harry frequently makes little pleas that he should have just started to rewrite the software from scratch. The software is complicated because it is too big to do in simple stages: there is not enough rooom on their 100gig disks for intermediate data. One great bug turns out to be an infinite loop, triggered by data from the South Pole, with a 0.
But what I find most interesting in the log is how many of the problems that Harry feels strongly enough to write down come down, at the end of the day, to having to use untagged data. There is no-where in the data files to clearly state what the data fields mean: he has to reverse engineer the meaning in some cases. He has a persistent problem that files from different regions have different conventions, and multiple notations are used (0-360 degree, or -180 to 180 degree, for example).
Now that disk space is cheaper, it seems to me that XML's over-tagged design makes a lot of sense for long-term maintenance of scientific data. Extra fields/elements can be added, but everything is labelled. Data integrity can be verified using schemas. Headers or attributes can have the units clearly stated. Attributes can be used to explicitly mark up where values have been corrected or estimated, or where they are regarded as anomalous, perhaps with the old value being kept in place.
The kinds of issues that SGML and XML sought to address, in particular rigorous markup, are not only to help the publishing world escape the nightmares of maintaining troff, TeX and so on with for example Perl, but I think can also help the scientific world escape what we can see as a real maintenance problem with untagged text and binary data files and for example FORTRAN.
Post-XML, people have much stronger expectations that they should be able to use a data file immediately from inspection. XML provides syntax to provide regularity, which otherwise must be done through discipline. Sure, XML is verbose: but I suspect Harry's log would contain much less frustration if XML had been used (which it couldn't, then.)
Public data should be in standard public formats, using rigorous markup. The industrial/techniccal publishing world had similar issues with large uncontrolled datasets, variant formats, and the particular problem of having to have data that still could be used even after a hiatus of a few decades, and the ultimate solution (XML and Schemas) looks a pretty good match for science datasets too, at the cost of larger file sizes (suggesting a standard compressed XML might be most appropriate.)
[Update: By recommending XML over plain text or binary here, I didn't mean to exclude JSON: it has many properties that may make it even preferable to XML for scientific data storage, providing that there are structures in a header to hold the appropriate documentation of fields (datatype, units, range, precision, etc) and appropriate metadata (e.g. Dublin Core metadata.) ]