A Sketch on Modeling Dialects of XML File Formats

By Rick Jelliffe
April 20, 2010 | Comments: 1

This is a follow-up to my blog Supporting Degradation: towards a workable Open Packaging standard.

If we are looking at schema languages and how to declare and manage them, we can come up with perhaps 5 layers:

  • Specifying that some data has the correct value, e.g. composed of certain characters
  • Specifying that some markup is used correctly, e.g. composed of certain elements
  • Specifying that a document has what we expect it has, e.g. composed of certain namespaces
  • Specifying that some document set has what we expect, e.g. links to data of the right type or composed of certain kinds of files
  • Specifying that, in the case of so-called multivalent or composite documents, that the alternatives form a complete version of the document, e.g. so where there are several alternative formats given for formulae, that there are JPEGs and MathML versions of all formulae, according to some edition of some profile or specification.

However the further we go along this list, the less capable our schema languages are to express things. We can use XSD Datatypes or ISO Character Repertoires for the data specification, RELAX NG or XSD Structures for the (logical) markup checking, NVDL for namespace checking, but by the time we get to the last two support from schema languages is getting pretty thin. Schematron does have some capabilities for link checking.

So here is a thought experiment, which is a model of OOXML and its dialects. I could have chosen ODF or HTML. When we talk about OOXML sometimes we mean one if its formal specifications, sometimes we mean a file format, sometimes we mean what an application can generate or consume. This allows a tremendous amount of loose and futile talk about it.

A further consideration is that applications frequently follow Postel's law, and so the dialects they implement can be best approximated as some kind of union of other dialects: I use the + character below.

<dialect-model>

<name>Open XML</name>

<editions>
<edition id="e1">
<name href="....pdf" >ECMA 376 (1st ed)</name>
<edition>

<edition id="i1" supercedes="e1">
<name href="....pdf">IS 29500:2008</name>
<edition>

<edition id="i2" supercedes="i1">
<name href="....pdf">IS 29500:2008 w corrections</name>
<name href="....pdf">ECMA 376 (2nd)</name>
<edition>

<edition id="i3" supercedes="i2" >
<name href="....pdf">IS 29500:2011</name>
<name href="....pdf">ECMA 376 (3rd)</name>
<edition>

<edition id="cjk1" >
<name href="....pdf">IS 29500:2012 Part 5 CJK Extensions</name>
<edition>
</editions>

<packages>
<package id="wp">
<name>Open XML Word Processing Package</name>
<extension>DOCX</extension>
<mimetype>application/openxml-wordprocessingml+xml</mimetype><!-- ?? -->

<dialect id="ecma-wp" edition="e1" >
<name>ECMA</name>
</dialect>
<dialect id="it1-wp" edition="i1 i2 i3">
<name>ISO Transitional</name>
</dialect>
<dialect id="is1-wp" edition="i1 i2">
<name>ISO Strict</name>
</dialect>

<dialect id="is2-wp" edition="i3">
<name>ISO Strict (new namespace)</name>
<map from="is1-wp" href="Ooxml2008To2010.dsrl" />
<schema href="Ooxml2010wp.nvdl" />
<indicator ns="http://purl.oclc.org/ooxml/wordprocessingml/*" />
</dialect>
</package>

<package id="sp">
<name>Open XML Spreadsheet Package</name>

<extension>XSLX</extension>

<dialect id="ecma-sp" edition="e1" >
<name>ECMA</name>
</dialect>
<dialect id="it1-sp" edition="i1 i2 i3">
<name>ISO Transitional</name>
</dialect>
<dialect id="is1-sp" edition="i1 i2">
<name>ISO Strict</name>
</dialect>
<dialect id="is2-sp" edition="i3">
<name>ISO Strict (new namespace)</name>
<map from="is1-sp" href="Ooxml2008To2010.dsrl" />
<schema href="Ooxml2010sp.nvdl" />
<indicator ns="http://purl.oclc.org/ooxml/spreadsheetml/*" />
</dialect>
</package>


<package id="ps">
<name>Open XML Presentation Package</name>

<extension>PPTX</extension>

<dialect id="ecma-ps" edition="e1" >
<name>ECMA</name>
</dialect>
<dialect id="it1-ps" edition="i1 i2 i3">
<name>ISO Transitional</name>
</dialect>
<dialect id="is1-ps" edition="i1 i2">
<name>ISO Strict</name>
</dialect>
<dialect id="is2-ps" edition="i3">
<name>ISO Strict (new namespace)</name>
<map from="is1-ps" href="Ooxml2008To2010.dsrl" />
<schema href="Ooxml2010ps.nvdl" />
<indicator ns="http://purl.oclc.org/ooxml/presentationml/*" />
</dialect>

</package>

<extension>
<dialect id="cjk" use="is2-wp it1-wp" edition="cjk2">
<name>ISO CJK Extensions</name>
<indicator ns="http://purl.oclc.org/ooxml/wordprocessingml/extension/cjk/*" />
</dialect>
<extension>
<packages>


<applications>
<application>
<name>Office 2007</name>
<generate dialect="ecma-wp" />
<generate dialect="ecma-sp" />
<generate dialect=" ecma-ps" />
<consume dialect="ecma-wp" />
<consume dialect= ecma-sp" />
<consume dialect="ecma-ps" />

</application>

<application>
<name>Office 2007 SP2</name>
<consume dialect="ecma-wp + it1-wp " />
<consume dialect="ecma-sp + it1-sp " />
<consume dialect=" ecma-ps+ it1-ps " />
<generate dialect="it1-wp" />
<generate dialect= it1-sp" />
<generate dialect="it1-ps" />
</application>

<application>
<name>Office 2010</name>
<consume dialect="ecma-wp + it1-wp + it2-wp" /> <!-- + is union -->
<consume dialect="ecma-sp + it1-sp + it2-sp" />
<consume dialect="ecma-ps+ it1-ps + it2-ps" />
<consume dialect="is1-wp" />
<consume dialect= is1-sp" />
<consume dialect="is1-ps" />
<generate dialect="it2-wp" />
<generate dialect= it2-sp" />
<generate dialect="it2-ps" />
</application>


<application>
<name>Office 2010 SP1 ???</name>
<consume dialect="ecma-wp + it1-wp + it2-wp + cjk" />
<consume dialect="ecma-sp + it1-sp + it2-sp" />
<consume dialect="ecma-ps+ it1-ps + it2-ps" />
<consume dialect="is1-wp + cjk" />
<consume dialect= is1-sp" />
<consume dialect="is1-ps" />
<generate dialect="it2-wp + ckj" />
<generate dialect= it2-sp" />
<generate dialect="it2-ps" />
<generate dialect="is1-wp + cjk" status="deprecate" />
<generate dialect= is1-sp" status="deprecate" />
<generate dialect="is1-ps" status="deprecate" />
</application>

</applications>

</dialect-model>



One point I hope this exercise shakes out is that it is impossible to model well the kinds of relationships that actually exist in the long-term big-gun consumer formats like OOXML, ODF and HTML using the existing schema languages: neither XSD nor the DSDL languages.

Sure, XSD type derivation does have some capabilities for modeling derived types, and ISO DSRL does allow one dialects to be defined in terms of another, but there is no serious way to model dialects qua dialects which can potentially work at all of the levels I suggested in the bullet list above.

Having a model in XML provides the framework to allow further metadata to be registered.

But if we did want to use DSRL or XSLT to make transforms between official and unofficial dialects, we don't have any framework to fit them with. (The document model above is a natural place to register such transforms, to be used by Open/Save dialogs.) The dialect model above may seem to readers to be less pellucid than I might think it is, but without such a model, our schemas do not really model some of the most useful or important parts of a document, as needed for interoperability.


You might also be interested in:

1 Comment

Thank you so much for this tutorial. It was quite helpful. I was dying trying to figure this out on my own.
http://www.ecrion.com

News Topics

Recommended for You

Got a Question?