Jon Bosak, who founded the XML and ODF efforts among many other achievements, recently wrote an article concerning the position of ODF, Open XML and PDF Jon's public writings are rare, well-considered and always of interest. As with other Sun-affiliated people in recent times, Jon has been exemplary in that even though he has a side, he does not take sides. I think I can agree with much of what he says, though I would note Don't forget about HTML.
Jon's awareness of the different capabilities of different formats, and the centrality of having a clear understanding of these, stimulated me to try to articulate something I have been thinking about for while: is ODF better considered a replacement for .DOC or .RTF, and can it be a substitute for both? And are there any bigger issues than this looming, that we should take stock of?
First, a potted history of the document format landscape over last 25 years.
One Thousand Islands
In the above diagram, each of the bubbles represents numerous formats, mostly dependent on particular vendors, applications and platforms. But each served a different purpose:
- plain text was universally read and writable (within a locale) and directly printable on even the simplest of output devices, but with poor capabilities;
- rich text was simple to generate by programs, and suitable for such things as documentation;
- the native binary formats of applications were intended to provide full fidelity, exactly match the features of a particular version of the application, and to lock-in customers to a vendor's application;
- markup languages allowed a network effect, where programming and text processing tools could be used in document production, and where they could be converted into the other kinds of formats; markup languages were particularly associated with UNIX systems (indeed, UNIX had been originally developed in order to support document processing using simple markup);
- page description languages were terminal or read-only formats, concentrating on putting characters or pixels on pages.
By the mid 1990s, this had changed, in a period of massive consolidation as the web, Microsoft and Adobe starved rivals of oxygen:
- Microsoft's .RTF format had become the universally supported rich text format (aided by the bundling of Windows with RTF-aware applications) and this lead to RTF's emergence as the usual format for document interchange between different system: lossy but workable;
- as Word increasingly became dominant, the .DOC format started to get wide support through assiduous reverse-engineering by Microsoft's competitors;
- piggybacking on the market dominance of PostScript and the availability of a free PDF browser, the other page description languages well away;
- the phenomenal rise of HTML, backed by the free browser, quickly superseded plain documents
- the Standard Generalized Markup Language SGML consolidated many divergent capabilities of existing markup languages: however, SGML's value proposition of supporting large, complex documents through efficient streaming processing did not fit into the areas of explosive growth: the market demanded WYSIWYG applications which were based on in-memory access, however the slow speed of CPUs and high cost of RAM made WYSWIWYG SGML for large documents an unworkable combination with the technology of the day.
One fairly unheralded aspect of this was the rise in the publishing community, including high-end vendors, of an awareness of the benefits of co-operation and standardization of different kinds. SGML was born out of such cooperation at ISO (formally ISO/IEC Joint Technical Committee 1 on Information Technology), however vendors continued to cooperate in a more flexible consortium-like arena at the organization that later become OASIS Open, and many of the OASIS and ISO participants ultimately moved over to W3C to develop XML.
By the end of the 1990s, there had been a realignment of the rationales for the formats: in particular, HTML had moved from the dumb end of town into having its own compelling rationale: morphed from being merely a richer replacement for plain text to being a format for instant, dynamic, customizable documents that looked good on a wide variety of output renderers. Governments with a vital interest in accessibility (e.g. systems and documents that do not discriminate against citizens with special eyesight requirements) pricked up their ears.
Of these formats, Microsoft controlled .DOC, .RTF, and was actively involved at W3C in XML and HTML development. PDF was controlled by Adobe.
The early part of the 2000s was a period of extraordinary growth for applications of XML, matched by stagnation in the other areas.
The dynamism of the open standards was in marked contrast to the stagnation of the corporate-controlled formats. However, during this period, Microsoft adopted a policy of containment to prevent HTML from taking market share from .DOC or even .RTF systems: notoriously with an unwillingness to support CSS thoroughly, for example features such as generated text.
There is a clear sign of a vendor trying to contain a rival technology: they provide either import or export to a foreign format, but not both. The inability of word processors to import HTML adequately during this period is an example, but it continues on a good trick (at the user's expense) today: OpenOffice 3.0 imports OOXML but does not export it, for example.
The large vendors also spent a deal of effort in trying to get effective stories for the XUL (XML user interface) and animation areas (e.g. Flash), which remain non-standard.
In the late noughties, a period of standardization broke through, with all the major formats being taken by their stakeholders through industry consortia and on to ISO. (Yes, there is even an ISO HTML standard, which was made in the late 1990s in order to endorse but not hijack the W3C HTML effort: strictly, it is a profile that has certain structural requirements such as requiring that div elements start with headings.)
The enormous controversy over the standardization of OOXML comes largely down to, in my opinion, two different views of the emerging landscape.
Business as Usual
The first landscape (lets call it business as usual) sees ODF as superseding RTF: RTF-in-XML-in-ZIP. In other words, the primary role of RTF is for medium quality document interchange between applications, without guaranteeing that any custom features of individual applications will necessarily survive (let alone have graceful degradation). This seems to be the angle that Microsoft is coming from, and the one that, I think, won out at ISO: that OOXML distinguishes itself from ODF by being DOC-in-XML-in-ZIP, with an emphasis on fidelity to a particular rich feature set rather than RTF-in-XML-in-ZIP, with an emphasis on interchange between systems with a basic feature set.
I think this is also a reason why a much harder standard was placed on completeness of documentation in IS29500 OOXML compared to the looseness of IS26300 ODF. For example, none of the compatibility settings in an ODF file generated by Open Office are documented in the ODF standard; however, the lack of adequate documentation of some compatibility setting in the draft of IS29500 was deemed to be a terrible flaw! I have heard some people complain that this comparative severity against DIS2950 and leniency towards IS26300 did not have an justifiable basis, in the non-discriminatory (some would say, undiscriminating) world of ISO/IEC JTC1. Without giving an inch that these flaws were indeed legitimate showstoppers for draft IS29500, I see nothing hypocritical in expecting IS29500 to provide this information while being less concerned about its absence in IS26300, if we indeed accept that ODF is primarily an RTF replacement. (I think the tolerance for a underspecification of ODF 1.0 in the name of interoperability was also a reason why the revelation that ODF used its own incompatible dialect of SVG in a different namespace caused so much disappointment to open-minded reviewers of DIS29500 (who were expecting to find OOXML unreservedly inferior to ODF): it undermined arguments against OOXML for "not following standards" and the dangers of embrace-and-extend.)
The Bermuda Triangle
The second landscape (lets call it the Bermuda triangle) sees ODF as replacing .DOC, with RTF becoming swallowed by both ODF and HTML. I think this is the landscape anticipated by people who don't see much value in IS29500 or multiple targeted fornats.
This second approach has a lot to commend it: HTML really does supersede RTF as being the interchange format for simple rich text. But the idea that ODF made OOXML unnecessary made many document format experts aghast: RTF and DOC had co-existed for decades because they addressed different problems and served different requirements: interchange versus fidelity couldn't go away merely by wishful thinking.
Is this a sleight of hand? ODF pretty much exactly matches the feature set of Open Office/Symphony (it is Open Office's native/default format: Lotus Symphony grafts the Open Office application onto the Eclipse Rich Client Platform) but is still quite a way in excess of the capabilities of many other FOSS word processors, the danger of the Bermuda triangle is perhaps that in effect it redefines document interchange as being interchange with Open Office: features in excess of Open Office become non-standard (i.e. other native/default formats) and systems which support a lesser feature set become non-conforming (i.e. the simpler RTF systems) .
My own view is that the Bermuda triangle would be preferable, and it may be where we end up, but that the only way there is to first standardize business as usual. From this base, there is a hope of moving forward, but only if ODF, HTML and even PDF are significant improved, with proper vendor support:
- HTML needs to be developed further. There have been some positive signs: the HTML 5 effort has reanimated the corpse at W3C, Microsoft IE8 has better CSS support, and Microsoft Office has better MathML support. But Microsoft needs to commit to supporting fully the HTML/CSS/MathML/XForms/SVG Tiny architecture, and they need to re-engage with W3C efforts. Some W3C standards may need to get some accommodation as well: for example, is the SVG page geometry skewed towards interoperability with Adobe and skewed against interoperability with Microsoft? The W3C has never come to terms with compound documents and archives: use MIME multi-part is the kneejerk response which utterly betrays the idea that all resources should be identifiable; the W3C needs to get serious about enhancing URLs to allow navigation inside ZIP and OPC packages.
- ODF (2.0) needs to be re-developed with a view to better supporting alternatives and graceful degradation. Requiring only one technology looks good on paper, but ignores the operation of the bazaar and the market: the danger of the current positioning of ODF is that it may occupy a no-mans-land, being slightly too complex for the open source word processors but slightly to simple to be adequate as a native format. This not only an issue of supporting Word better, but also Word Perfect and other binaries. A particularly clear example of this issue is the challenge of the Chinese UOF format, which seems to be an early fork of ODF plus several specific Chinese typesetting features that have avoided incorporation into ODF, with Chinese language element names. A framework that is based on knowing which alternatives get outdated by editing (and so must be removed) would also allow the introduction of many useful innovations: consider ODF or OOXML with a special part that gives the page, column, cell and line starts for the current document in a particular application (linking to the text positions in the data contents), and allows these to be frozen for other applications: this would give much greater page or line-level fidelity between applications. Support for arbitrarily formatted XML foreign content should be more serious.
- PDF has three challengers, none particularly serious at the moment: Microsoft has its XPS format (another XML-in-ZIP), Adobe has some similar technology (MARS) and there is a mooted Chinese push for the OASIS UOML (Unstructured Objects Markup Language: a simple modern PDL) to be standardized. The various efforts to standardize profiles of PDF are well underway, and recent capabilities of PDF for minimal editability and structured PDF allow better integration. But the challenge is for PDF with embedded SVG, and for ODF (2.0) with embedded PDF alternatives, and so on.
So what are common threads required if we are to move from business as usual to the Bermuda triangle?
- Support for a common next-generation document packaging (XML-in-ZIP) format to be assertively supported by W3C, ODF (2.0) and son-of-PDF. I think I29500 Open Packaging Conventions provides a primary resource for this, but it needs to have more explicit support for plurality, modularity, and alternatives: graceful degradation and multiple formats.
- Ubquitous support for unvarnished SVG Tiny, MathML, and XForms by W3C, ODF (2.0) and son-of-PDF. From the other direction, SVG may need to be updated to cope with SmartArt: this is the MS marketing name for the feature in OOXML DrawingML that allow separation of data from diagram template.
- A co-operative carve up of RTF's functionality between HTML and ODF 2.0. It seems fairly clear that ODF 1.2 is rather a high bar for many existing suites other than OpenOffice/Symphony to reach, which undermines the substitutability rationale; this may change with more focus and development, and the idea of an ODF profile aimed at supporting broad office documents but being low-hanging fruit for 100% support by existing word processing applications (Word Perfect, KOffice, AbiWord, etc) is enticing. Would such a profile be significantly different from what HTML can provide to justify it, or should
- A real effort to add support for features in IS29500 and UOF etc to ODF 2.0, with an emphasis on graceful degradation
- More support for richer XML media and specialist content types.
- Carrots and sticks to entice and direct Microsoft down the path towards the Bermuda triangle.
Now these are just for word processing: the spreadsheet and presentation formats also need to be factored in. It seems to me that presentation applications currently sit at an unhappy position between flick-cards (which HTML surely has the legs) and simple animations/interactivity (where Flash has the legs.) Even the current idea of ODF (and to a lesser extent of OOXML) —that you can express word processing information with simple structures decorated by optional compatibility properties/attributes that allow graceful degradation and partial implementation—is problematic: if that is the case, why not just use HTML with extra CSS properties?
The Real Present and the Real Future
Where does this lead? I suggest that perhaps the looming challenge for document standards is not in deciding or developing perfect formats, but in integrating the packaged world of documents with the fragmented world of web resources. Documents that can be websites. Page description files with external resources. Arbitrarily nestable documents. Web applications that are single files and editable as applications by word processors: in this vision, Open Office or MS Word (or perhaps MS Powerpoint) should be able to open up and edit a .WAR web archive file (or their successors.)
Again, the centre requirement is the document packaging mechanism, enhanced to allow unpackaging and web service. The following diagram, though almost uselessly general, lists various components that are entirely common to both websites and XML-in-ZIP compound documents: why are we squabbling in such minute detail on solutions to XML-in-ZIP formats that do nothing to address this tectonic rift between documents and websites? Because we allow our expectations to be dictated by menu bars: something that currently has a Save As.. or Open menu item is real...
It is obviously not an outlandish vision: there is piecemeal support for many parts already. ODF and OOXML's packaging mechanism each integrate into the MIME media content types system and have many connections with the W3C mini-formats (Dublin Core, MathML, etc.). Word supports blogging using Atom distribution. Some file system managers allow ZIP archives to be treated as if the were file systems. Synchronization and offline behaviour is a mainstream behaviour because of mobile and PDA systems. There are printers that support JPEG and SVG directly. Some word processors do have publish and save-as-website converters. The WWW's implicit REST approach encourages richer information bundles rather than stateful servers. And the variety of possible future uses of documents does not give archivists an unambiguous single-format to adopt: PDF is good for the reasons Jon Bosak mentions, however a full fidelity version can be good for editability, yet adopting a common lossy "pivot" format (such as ODF) has distinct advantages too.
We need to be able to have our cakes and eat them. Formats that don't support plurality (modularity, alternatives) force us to have documents that by meeting one set of constraints fail to meet another. Even if ultimately RTF can be swallowed by ODF (2.0) and HTML in the Bermuda Triangle approach mentioned above, the lack of interoperability between office document applications (or formats) and websites means we have taken a step forward at the same time as taking a step backwards.
Unless governments and other stakeholders can get beyond the narrow view of documents and interoperability as merely being exchanging data from one similar application to another, and move towards the view that documents and web resources need to be end points on the same interoperable spectrum, we are selling ourselves short.
It is here that standards bodies should be more help: but I don't know that they can be unless there is a stronger commitment to supporting each others' visions better. W3C's mission statement is concerned with bringing the web to its full potential, and W3C have traditionally used this to justify shying away from old-fashioned compound file-based issues: the lack of standards for the *SP (JSP, ASP, PHP) class of documents is a symptom of this, and it is notable that much of XML's uptake came because it did take care of practical production issues (i.e. issues pertaining to the document as it existed before being made available as a resource —PIs and entities—and after it had been retrieved—character encodings.) The industry consortia such as ECMA and OASIS are organized around interest groups on particular standards, which makes it easy to fob off discussion of interoperability. And even ISO, where the availability are topic-based working groups with very broad interests should provide a more workable home for this kind of effort, have a strong disinclination to seek out work that involves liaison with other standards groups: satisfying two sets of procedures and fitting in with two sets of deadlines and timetables can be impractical and disenfranchising for volunteers and small-business/academic experts.
I don't have high expectations of leadership from many other groups. FOSS people often are concerned with pioneering individual efforts rather than collective efforts. Content management businesses and web-app businesses may feel threatened, that their concerns are with websites not office documents. DBMS people have often only been interested in documents to the extent that they fit into the DBMS view: a small document is a report, a large document is a database, anything else is an application. And the large vendors or open source projects which straddle both documents and the web, such as Sun, IBM, and Microsoft, are still (I think) focussed on the dubious delights of business as usual and the Bermuda triangle.
The current bad-mouthing campaign against ISO is extremely unhelpful and unproductive in this regard. It encourages a bunker mentality, a Balkanized standards milieu, which may fit into us-versus-them marketeers but ultimately ensures that the big issues of interoperability get hidden behind the veneer of trivia (how many ways to say "bold text")
Towards Web-able Document Formats
So what is there that we can do to nudge us in the direction of this large-scale interoperability? Participate in the most congenial standards body to your interests, and encourage them to support plurality (modularity, alternatives, feature harmonization with each others standards, graceful degradation) and the dissolution of the great website/document divide. And to foster good reasons (carrots and sticks) so that the large developers see value in this kind of enabling standard. For developers, make browser plugins that locate resources in an unzipped XML-in-ZIP document delivered from a vanilla web server; or enhance the existing XML-in-ZIP formats to cope with web formats (ODF and OOXML with CSS and SVG Tiny, for example) better: round-trip rather than just convert-and-discard. There is lots to do —unblock the plumbing: for example, get standard URL format for compound documents, particularly inside ZIP archives and a standard for ZIP.