Does an 'open' format provide the benefits it is supposed to?

By Rick Jelliffe
March 22, 2009 | Comments: 11

One thing that struck me when reading Fort Worth legislator pushes for open format in state documents (hat-tip to Nick Carr...the other one) was the seamless transition in the article from open standard to open source as if the one necessitated the other. Would that it were that simple!

I don't see that the current open standards for office document formats automatically provide all the particular advantages in the way people seem to think. It is of course better that the format information is published, that the IP has been sorted out, that it has been QAed by independent organizations to the champions, and that there are processes in place which, if taken advantage of, can allow broader future participation by users and other stakeholders.

But the information in those kinds of documents is basically the same as the old .RTF formats. And these were already widely implemented, but the wide implementation had not favoured open source particularly well, up to that stage. (And, I support the RTF being de-proprietorized by making a standard description of it too, by the way...)

Now I support all government/QANGO public information on public sites being available in open formats, and in a variety of open formats with HTML, PDF (if that makes the cut), ODF and DAISY in particular.

But open formats are not pixie dust that necessarily change the game, though they may give the appearance of change necessary to agitate the market (which is not necessarily a bad thing.)

So what would be a game-changer? Slightly more semantic markup. In the case of legislation, this means strictly adhering to a cohesive set of styles, where the styles are based on on a common pan-jurisdiction style catalog of some kind, and where there is even the most basic QA mechanism in place to make sure that the styles are being adhered to. The trade jargon for this is rigorous markup.

When you have styles, and slightly more semantic information, you can use the data in quite different kinds of systems and workflows. Changing RTF's {i Carlill v. Carbolic Smoke Ball Co.} to HTML/ODF <i>Carlill v. Carbolic Smoke Ball Co.</i> or OOXML's <r><rPr><i/></rPr><t>Carlill v. Carbolic Smoke Ball Co.</t></r> is very nice, but the breakthrough comes when you have <i class="case-cit">Carlill v. Carbolic Smoke Ball Co.</i> or the XML-ish <cite ref="1893 1 QB 256 ">Carlill v. Carbolic Smoke Ball Co.</cite>

Legislators interested in this area should, as part of governance in adopting open formats, check whether there has been a common set of appropriate semantic styles developed and adopted, and what QA is in place to verify that in fact the styles are in use in a way that allows game-changing systems to be introduced.


You might also be interested in:

11 Comments

Your argument seems to be that a Sumerian cuneiform encoding of a cipher written by jargon obsessed scribe with a large patent portfolio and legislation prohibited reverse engineering is just as good as a cleanly written text encoded in an open standard format, since "the information in those kinds of documents is basically the same".

It isn't really a question about the nature of the information itself, but the ability to access that information, by users and those whom they choose to share that information with. Open standards are about freedom of access.

History has taught us that proprietary formats can be, and indeed have been, abused to lock users into particular software applications. Were you not aware of this? This open access aspect of using open document formats is, in itself, a game changer.

In any case it sounds like you've independently discovered the semantic web. Some guy over at the W3C -- Tim, Jim, something like that -- has done some work on it as well. You might want to check it out ;-)

C'mon Rob, Rick and the community he spawned in were talking semantic markup before Berners-Lee wrote his first paper on simple hyperlinking. A simple class citation with a domain list for the values in it (names don't do it alone) can go a long way toward sorting out the intension of the author.

As to lock in, Sun is still sorting out the Java embroglio while fire saleing the corporation because a commitment to open-everything is a suicide pact with the customers. You can't stay in business giving away gold and selling the boxes it comes in. Meanwhile Microsoft is still floating in cash.

Ten years in, no matter what we think about the fairness of it, Adobe and Microsoft stayed alive with proprietary software even when the formats are open. The only markets I see where there is some reasonable compromise are where the standards process is guaranteed, the IP is sorted in advance by participation agreements, and the standard has a reference model built into the standard itself.

And Assyrians were using determinatives (aka semagrams)as semantic tags 3000 years before Goldfarb had his baby teeth. That's my point. It is obvious and has absolutely nothing to do with the benefits of open standards. For all we know Linear A is written in brilliant "rigorous markup". But because the "format" has not been "reverse-engineered" yet we have zero idea what it means. Until you have open formats, your documents are hostage to your vendor for as long as they choose to support the products that read and write that format. When they die or move on, then your documents are equally dead. Ask Microsoft Works customers about that.

And have you looked at Microsoft's cash position lately? Compare what they have now versus 3 or even 5 years ago. "Floating" is not the first word that comes to mind.

Rob: Yes, I am certainly not claiming any originality for my point: it is widely held in my industry, based on years of experience in trying to make workable systems out of the sow's ears that vendors provide, all the time insisting that they magically know what we need.

However it is a point that needs to be made. "Rigorous markup" is the phrase in IS 8879:1986 (SGML) IIRC (certainly it is in Goldfarb's SGML Handbook). It is certainly not T.B-L's Semantic Web however.

Len: I think you are making a category error. When someone's only method of discourse is attack, and the only kind of discourse he perceives is also attack, then trying to get a meeting of the minds by clarifying your point is useless. The best you can do is hope that informed readers will be able understand enough of both sides to figure out their own positions.

So the blather about Sumarians and Assysrians is not part of a reasoned argument; it is just a rhetorical flourish. Nor is the rehash of unrelated talking points (proprietary formats lock in?, Microsoft Works? pullease!): that is just a diversion to keep readers' minds circling in the allowed orbit.

My theory is that just as these suits only see things in terms of suite-developer skirmishes (the "elephant in the room"), the only 'game change' that can manage to come into their focus relates to advantaging or disavantaging their company's current products and strategy.

So saying that a move to better labeling (which, in the case of office applications means more rigorous use of pan-industry standard style names) as a necessary step towards a game-changed architecture, won't make any sense to them: it does not compute.

Now of course, generic identification of information is in fact central to the old SGML agenda: but it is a user-side issue (in particular, an integrator issue, the integrators being the front-line of users.)

I wrote a blog this week about Dr Sefton's comments from his long-term and detailed experience, and I don't think I am saying anything more than he would say, concerning the centrality of rigorous use of styles as the driving force for enabling new office architectures.

The expression "game-changer" makes my eyes glaze over, especially when "pixie dust" is used in the same sentence. (Are you saying that semantic markup _is_ pixie dust?) The benefits of open formats are file reconstructability (my favorite buzzword this week) and not being locked in to specific software. Semantic markup also contributes to reconstructability, but only if the format can be read in the first place (which I think was the point of the references to Linear-A, etc., even if the rhetoric was overblown). The ability to mine information from a document is nice, but in what sense is it "game-changing"?

My phrase is: a picture is worth a thousand words only if you have a thousand words to trade.

From the genCodes to the format codes to the content codes, we've known the weakness of tagging is no amount of markup of any sort removes the need for interpretation or process. It just makes the coupling looser and that has benefits but they come at a price: negotiation or guessing.

You're right, Rick, about the category error. It isn't worth wasting your time with my response. I think I am distracting myself from immediacy. Round and round the moebus loop the data chased the easel....

And now.... M!! Revenge of the Curly. :-)

Gary: If you would send a list of common pithy expressions or cliches that don't make your "eyes glaze over", please send them and I will give them every consideration ;-)

Semantic web as pixie dust? I don't see where you get that: I didn't bring it up and only mentioned it to distinguish that rigorous markup ("slightly more semantic markup") is not the SW.

My point is simple: merely substituting one application from a similar application by a different vendor is not a disruptive change (except perhaps to the commercial interests of a couple of US companies, who are not particularly interesting from this side of the globe). "Not being locked in to specific software" is where we were with RTF, for normal, small office WP documents.

Being able to express simple presentation-oriented information in a new shiny angle-bracketed way does not really change things. Simple generic markup, even if just at the next level up from presentation, like HTML, allows applications that allow more sophisticated processing of the document.

This can work either by allowing more generic markup and more interesting styling (such as Word's SmartArt where generic list structures can be styled into various kinds of diagrams), or more specific markup, such as the case-citation example I gave.

Now this is not theoretical, it has been the insight that drove the SGML industry and which still largely drives high-end/high-value document processing. It works. These much, much smarter systems have been in place and working for up to two decades.

The challenge is to bring the kinds of capabilities of higher-level markup down to the consumer level, where the data has a lower value that cannot justify SGML-style processes and markup. (There are a lot of improvements, MathML and SVG, for example.)

The disruption occurs because new ways of presenting, processing and accessing the information becomes possible. For example, a spreadsheet which is also an XBRL document would change the game for financial governance.

Gary, that is my point exactly. Rick has merely "discovered" a property of language itself, a property which has existed since the dawn of recorded history, and tells us that it is a "game changer". Yawn.

Rick also fails to see the value of an open document exchange standard, probably the only one I know who suffers from lacuna amnesia in this area. It is not only about substitutability of applications. It is primarily about document exchange. Similarly, the value of convergence on TCP/IP as a networking protocol was not to aid network card manufacturers, though it certainly did that. The primary benefit was the increase in communications. The primary value of HTML was not to aid browser vendors. In fact almost no one makes money selling browsers today. The value of HTML is in content publishing.

In general, we need to look at the connections that an open standard brings, not just the nodes. If you miss that, you miss everything.

Rob: You use discover as if I am claiming some new insight. Yet I specifically acknowledged the idea was not new and had well-known trade jargon for it. Then you restate it with quotes, presumably not realizing you in fact quoting yourself derrogatarily: a one-man Chinese Whisper campaign!

You will undoubtedly be relieved to realize that you seem to have read my blog entirely incorrectly. You appear to have skipped over the words "automatically" and "necessarily" which make the blog into a suggestion of a possibility (a realistic one, I think) rather than a doctrinaire or absolutist denial.

If you don't know anyone who doesn't think the same way as you do, it may be a symptom of living in an echo chamber. Or maybe people respect you too much to contradict you, though I think frankness is sometimes more respectful? (Personally, I think different opinions are useful, and I am never surprised to find people who think differently from me, nor do I consider it a great defect in them: perhaps that is why I think broad representation is good, while you think limited representation with only people with affiliations like yours is all that is necessary in the ODF TC.)

Since you so kindly inquired as to my health, actually I did suffer severely from amnesia for a period in my early 30s: thankfully it wasn't the brain tumour the doctors scared me with. I still have it in mild form for some things: I usually cannot retain more than 4 digits in a sequence, for example.

As I understand it, lacuna amnesia is a dissociative amnesia, which relates to blocking out a single (presumably traumatic) event. I am not aware of having that, but presumably one wouldn't be.

As I understand it, you are saying that we need common styling in addition to common (and open data formats). I would agree wholeheartedly.

I recently attempted to put together a family newsletter. Each contributor used whatever software he or she had (Word 2003/Windows, Word 2007/Windows, OpenOffice 3.0/Windows, OpenOffice 2.4/Linux) and presentational formatting.

The collision of fonts and formatting was informative. I ended up opening each file in OpenOffice/Linux and applying styles on the fly. I should note that certain styles in OOo 2.4.x fail to change the appearance of subsequent instances. So those parts had to have presentational styling added on top of the styling.

Tool support for exchanging pre-set styles is something that would mean a lot in those conditions. Such styling should be enforceable, as in "Family Newsletter" allowing only a certain set of styles and limited presentational (non-style) formatting outside of that list.

I would love to see such styles work across vendors' products and even file formats. Why should the end-user have to be concerned that the requisite styling for his company's documents/spreadsheets might not work for the people in department X, using a different OS or application than the rest of the company?

On the automated tools front, I am sure that each of us are aware of times when presentational formatting is used to convey additional information relevant to the content, but the tools fail to pick it up.

Obviously, file format interop is the basis for any of this. If your application's file format isn't useful for my application, how will I use the styling in my own documents? And openness is at the core of everything. The less risk that someone will assert an IP claim against me, the better chance that I will attempt to find new ways to reuse content embedded in the application's files in new ways.


W^L wrote: As I understand it, you are saying that we need common styling in addition to common (and open data formats).

Yes. And it is not a blue sky idea: Powerpoint's "themes" and the "skins" used by many plug-in systems use advertised names that allow different styling. (I am not sure whether Impress' "Slide Design" is the same thing, but it may be.) It is a small change but a useful one.

On the comment "Obviously, file format interop is the basis for any of this. If your application's file format isn't useful for my application, how will I use the styling in my own documents?" I should point out that SGML/XML were based on allowing a radical separation of the document format from the style/theme system. This persists over at W3C, where you can use CSS for styling but also XSL.

I don't know whether there are clear economic/technical reasons to favour either a unitary style/theme system, or for each document system to grow their own, or whether just to all adopt a common one in addition to the local system (presumably it would be "everyone should support themes expressed as CSS styles".)

News Topics

Recommended for You

Got a Question?