Packaging formats of famous application/*+zip

By Rick Jelliffe
January 9, 2009 | Comments: 9

Here is a little table showing some of the characteristics of the various packaging formats used by modern XML-in-ZIP applications.

Packaging relates to the particular details of how files or resources are arranged in the ZIP archive.

The granddaddy of these packaging systems is the JAR system. This came from the Java world, takes a ZIP file and adds a META-INF directory. (Actually, IIRC ZIP is actually a flat container with no directories as such: systems conventionally interpret "/" in names as a directory, for convenience, so we can talk as if they are directories even though they may really just be part name prefixes.)

In JAR,  this META-INF directory contains various files, notably MANIFEST.MF which is a manifest file, which is a file that contains various kinds of useful data about the files in the archive.

There are at least four main streams of packaging systems:

  • IMS is a long-standing education and training stream, used by SCORM (learning objects) and perhaps S1000D (aerospace technical documents)
  • OPC is used by Microsoft for OOXML and XPS
  • ODF, which was missing key pieces in ODF 1.0 and 1.1, but which looks like getting key pieces for ODF 1.2
  • and a stream I think of as "fake ODF": this is a stream which includes EBooks OCF and more recently Adobe's UCF, but which seems to have adopted some container mechanism from a draft of ODF that was not eventually adopted. (An odd situation where these specifications claim to use a namespace URI urn:oasis:names:tc:opendocument:xmlns:container that OASIS does not seem to actually specify. It needs to be cleared up: you don't adopt draft namespaces into something and then claim they are somehow standard.)

There seems to be growing convergence. Adobe is pushing that there should be an OASIS group for packaging, which would presumably merge UCF and ODF 1.2 packaging. In ODF 1.2, the packaging forms a separate part (i.e. document), so it looks like things are set up well for this. It would be a good idea. 

Other areas of increasing agreement are on only supporting deflate compression, on using Dublin Core metadata, on using W3C Digital Signatures, on using W3C Encryption, and on using RDF for other metadata (or, at least, on providing a clear transformation from the specific markup to RDF.)

Some of the areas of disagreement relate to the different use-cases for the packaging. In particular, the issue of whether the package is supposed to hold a single document or publication, or whether it should hold multiple documents (or multiple publications or information packages or websites) or just be a single document.

The major differentiator between the different packaging mechanisms is whether they provide a system of indirection for identifying parts by short names.  The most immediate aspect of this is whether the root file (i.e. the file where the main data of the document is kept) is hardcoded (such as ODF's /content.xml or whether it is must be looked up in some other file. OOXML has an advanced system, the relationships files which provide a mapping from an identifier to a filename or external URI, rather like an SGML/XML entity set.


 

Feature

Packaging Format

Name

(ODF)

OPC (Open Packaging Convention)


OCF

(OEBPS Container Format)

UCF (Universal Container Format)

PIF Package Interchange Format

S1000D-in-SCORM

(In use?)

Uses

ODF

OpenDocument Format

OOXML, XPS

OEBPS Open E-Books

INX, MARS

IMS, SCORM

S1000D

Source

ISO, OASIS, Sun

ISO, ECMA, Microsoft


Adobe

Educational

Technical documentation

Schema

RELAX NG

XSD, RELAX NG

RELAX NG

RELAX NG

XSD

XSD

Media types

/mimetype

as first part of archive

/[Content_Types].xml

/mimetype

as first part of archive

/mimetype

as first part of archive

 

 

Path to root file

Hardcoded to /content.xml

Located by relationship in

/_rels/.rels

/META-INF/container.xml

from fake ODF*

(required)

/META-INF/container.xml

from fake ODF*

(optional)

Hardcoded. Manifest acts as root

 

Manifest

/META-INF/manifest.xml

In ODF1.2 updated to use RDF

/_rels/.rels

/META-INF/manifest.xml from ODF

/META-INF/manifest.xml from ODF

/imsmanifest.xml

Publication module

Signatures


Located by relationship, using W3C XML DSIG

/META-INF/signatures.xml from W3C XML DSIG

/META-INF/signatures.xml  undefined





Part Encryption

SHA1/Blowfish CFB/PBKDF2

/META-INF/encryption.xml from W3C XML ENC

/META-INF/encryption.xml undefined

 

 

Rights



/META-INF/rights.xml

not specified format

/META-INF/rights.xml

not specified format

 

In publication module

Metadata

/metadata.rdf in ODF 1.2

Core properties with Dublin Core, located by relationship

 

META-INF/metadata.xml with INX using Adobe XMP. XMP uses RDF

(in manifest)

In publication module. Recommends mapping to Dublin Core and RDF for output

Resource locators


Relationships system



In manifest, then linked-to sub-manifests

Module codes. These are structured names and so it act like directories. Used in entity references. Can use Xpaths into resources

Thumbnail

/Thumbnails/thumbnail.png but not in manifest

Located by relationship

X

X





ZIP allows encryption

Yes. SHA1 & RFC2898

No

No

No





ZIP compression

Deflate only

Deflate only


Deflate only

Deflate only

 

Multiple publication/document

No

No

Yes

Yes. Top-level application or document directories

Yes. Resources are active objects and can have parameters passed at invocation time

Yes. Publication modules reference data modules.

Interleaving allowed

No

Yes

No

I think there is good scope for useful convergence of the various standards. The key would be to piggybacking it on some mutually attractive feature, such as better support for multi-document packages and websites. I have mentioned this aspect before, when talking about whether a file can be ODF and OOXML and a website at the same time. Can it be ODF and XPS and MARS at the same time?

My expectation for convergence would be that there would be a level of convergence where everyone agrees on ZIP (deflate), self-identification of document type, multuple document support, /mimetype, W3C DSIG, Dublin Core metadata and IS29500 OPC's URL scheme for identifying parts, but then an advanced layer with more platform-dependent features on things like references, relationships, RDF and rights where one vendor's meat may be another's poison--encryption & DRM may certainly be contentious. (The OEBPS ssytem already has such a split model, with the OPF layer sitting on top of OCF to provide references, if my brief reading is correct.) The goal should including making sure the same archive can support a plurality of these different platforms without one locking out the other. (Which brings up consistency-guarantee issues, of course.)

Some of the incompatibilities are easier than might seem. For example, that ODF requires a thumbnails file in a single format, name and location, while OOXML allows multiple thumbnails in any formats, with name and locations independence is not actually incompatible. It just means that an OPC system should generate at least a PNG with that name and location.

Where does W3C's Compound Document Format fit in? The W3C pretty much starts off with the assumption that all resources are available separately on the WWW at separate URLs, and therefore packaging is unnecessary, if not quaint. So it is not surprising if they are not fussed with packaging mechanisms. (Of course, it is utter rubbish: what is important is that URI fragment references can be made for information at all grains, and these ZIP based packaging mechanisms have a scope and use and convenience far beyond what mime-multipart provides.)

But I think CDF may have a really useful role, because while I think that packaging is ripe for convergence of the basics and support for plurality for the platform-dependent issues, I think at the document level all the standard formats need to look into providing alternative chunks in simplistic W3C formats. OOXML already does this quite a bit in its alternative chunk mechanism. But it is not only OOXML that would benefit: it would be just as usefulful for ODF: allowing applications to provide real SVG-tiny as well as the fake-SVG that ODF uses (SVG element names in an OASIS namespace with ODF extensions and restrictions.)

To me, this is the only feasible route to format convergence: getting agreement on what almost everyone already supports (the low-hanging fruit), neutralizing any gratuitous limitations where there are legitimate areas of difference (extensibility), and supporting alternatives as a practical mechanism for allowing market/bazaar forces to determine the viability of different vocabularies and subformats (plurality.)

[Update: And, as with S1000D's treatment of RDF, I expect many bunfights about the one true format will be alleviated merely by providing standard mappings. (ODF could do this with their SVG dialect, for example.) If there is a pretty complete spec and running FOSS code available for going from my vocabulary of choice to your vocabulary of choice, there is less need for anyone to be doctrinaire, You say potato, I say potato. Take this to its logical conclusion, and mechanize it, and you end up nearer Ken Krechmer's Adaptability Standards.

But this isn't standards convergence dictated by wise men with grey beards, or by benevolent public servants, or good citizen corporations. While there is work involved, and agendas pushed, this is emergence rather than convergence: it is the organic plurality of the bazaar/market or the garden. ]


You might also be interested in:

9 Comments

Another format saved this way is the one used in Apples iWork 09. XML in ZIP as well. Too bad Apple provides extremely brief info on it...

Daniel: I went to the Apple Developer site developer.apple.com and did a search for "iworks" among other things. Result: not a single page of documentation.

It could be that their format is so transparent and luminous that it does not need documentation: we would have to get a file, open it up as zip and inspect it. We might find they re-use existing standards to such an extent that there is not need for independent documnetation.

Or it could be, I suppose, that they are endorsing the standard formats like ODF and OOXML. If you want interoperability, use those. They are saying "don't use this for interoperability".

I don't think that iWorks or Apple fit into the class of "market dominator", so I don't expect that they have a special duty to disclose for free-market reasons, so they would just have the ordinary requirement that without the documentation, outsiders cannot use it. We are only in '09 now, and still in holiday time for many people, so it would not be fair to read too much into it at this stage.

ffdfds

Thank you for this very interesting article.

I'm always disturbed by all those ZIP formats because, dynamically, they are more difficult to generate than pure XML where XSLT can be good enough. That's why I have already started to develop a generic tool able to convert from XML to ZIP and inversely.

Such XML structure should also be standardized and it would be nice if it could always be an alternative.

Alain: The initial XML formats of the office systems tried that (e.g. Word 2003) and a single file was not successful.

There are several reasons: first, because encoding images and other binaries in bin64 inside the XML file blows out the file size and conversion time (in ZIP, you just put the image without recoding); second, because of XML WF, the whole document has to be read serially even if you only want some portion (in ZIP, with multiple smaller parts, chances are that you can just load the part that you want); and third, for robustness because one WF error in the XML transmission causes the whole document to be unreadable (in the ZIP, if there is an error, only that part is lost.) There are probably other issues too, such as modularity.

What needs to happen is that the XML infrastructure needs to handle ZIP better. 99% of this is just to support a URI scheme which allows location inside a ZIP archive: IS29500 defines such a schema, which is compatible with the RFC's syntax (which Sun's jar: or zip: schemes are not, but I think they should be supported too.)

For example, an XSLT engine should be able to be invoked on a document that is a part of a ZIP archive using the URL, and its entity resolver should be able to cope with URLs that are relative to the ZIP archive as well as relative to the current part of the ZIP archive.

The other 1% is that schema languages need to support multiple documents better. In Schematron, for example, I am proposing an enhancement specifically to allow this: to allow
...

This article is discussed at the Universal Interoperability Council's web site.

Rick,

As a side note, we did a very similar thing back in 1995 released in 1996 with our "Portable Report Format" (PRF) file. This was a zip package with parts, albeit proprietory. Not XML (er, obviously), but another file format that lends itself to compression, sparse text based reports.

We amused ourselves over the years as more and more formats adopted the zip/parts approach.

http://news.cnet.com/Datawatch-dispatches-file-system/2100-1023_3-215903.html

Still it did have an amusing interop story of it's own, since Microsoft stomped all over the .PRF extension, meaning file associations got messed up by at least 2 different Microsoft file formats: PICS rules files and Outlook Profile files all hijacked the extension.

http://support.microsoft.com/kb/236892

http://support.microsoft.com/kb/308300

Not to mention ClarisWorks, Macromedia Director, etc etc.

We now mainly use PDF instead to conglomerate, compress, protect and distribute these files with bookmarks replacing the tree index representation we used in our client. Now there's a standard where you'd be hard pressed to find one valid instance in the wild outside Adobe produced files, and they're not always kosher either. Don't get me started.

Gareth

Rick, you say there seems to be growing convergence. Is some standards body or other promoting that convergence? Is any one vendor or organization pushing for convergence? What watering hole should I hang out in to meet those interested? :O)

Bruce: OASIS. I believe ISO?IEC JTC1 SC34 has a preference for vetting standards made by industry initiatives such as consortia, rather than being the prime developer: it is a workable forum so that different competing consortia each get a equal opportunity, if they want to play along in good faith.

The thing that will stop convergence is the lack of participation in standards efforts by significant stakeholders on the users side. USPTO for example!

All the big vendors have a strong interest in adopting an XML-in-ZIP format, just because it is a now obvious sweet spot. But they have no interest in getting their details harmonized, and a lot of mutual suspicion.

I believe Adobe has been making noises about an OASIS forum for their UCF. And ODF 1.2 also has been split into two parts, with packaging in one part IIRC, and I understand that the packaging part has not been receiving much attention. Because ODF is OASIS too, I would hope that any UCF effort and any ODF effort would be carried out by the same group, and it would be great if that group had members who were genuinely interested in having a converged standard.

Complete convergence is not always necessary or feasible of course, particularly for legacy. But for making a new agreement, I think people will expect at least an intent to reduce genuine differences.

What would be great would be an OASIS standard that could cope with ODF, Adobe UCF, and OOXML's OPC (e.g. OPC without the relationships) but which could, if there was any point, become an ISO standard that was invoked by both the ODF and OPC standards.

News Topics

Recommended for You

Got a Question?