The issue of document conformance is of course one of the core and perennial tasks that any standards group deals with. The issue only goes away when the standard dies. And each time there is a new group of stakeholders, the issue may need to be revisited, tweaked or augmented. That is just the nature of the business.
VerifiabilityConformance is hard. ISO standards have a constraint that only "verifiable" statements can be made in normative text: no airy fairy fluff. And I certainly belong to the camp that says that the clauses in IT standards (in particular document standards) should not only be "verifiable" but that they should be objectively and automatically verifiable in standard ways.
In other words, standards should limit themselves to constraints that can be expressed in schemas, as much as possible, and that schema languages therefore need to be smart enough to cope with the kinds of constraints that do in fact crop up with documents. Hence Schematron: indeed, as I have previously written here, it is even possible to write Schematron schemas that can be converted to ISO standard text!
Many a standard has fallen off the tracks by making objective-looking statements that cannot be verified, or by using home-made formal-looking notations that in fact are inadequate for detecting conformance (or which are inadequately specified such that there is no way to detect errors in the notation itself!)
I have wrote a piece recently Conformance classes should mirror stakeholder usage clusters, and have been tracking and commenting on various issues with ODF and OOXML conformance recently.
It seems to me that there is a piece missing. It applies to ODF, OOXML and other standard formats.
Rather than describe the problem in much more detail, I think it would be better just to give my first stab at a solution. It certainly has constraints which I think are at a higher level than we can expect schema languages to validate.
MODUS - Minimum Open Document Using Standards
A document is a MODUS document when the following constraints are all true:
- Only international standard container format (OPC, JFIF, ODF's)
- Only standard formats used for parts
- Only namespaces and values defined in public documentation
- All data defined by the standard should be available at least in eponymous standard form - alternative standard formats are allowed but no extensions or nonstandard formats
- All metadata defined by the standard should be available at least in eponymous standard form- alternatives possible and extensions allowed (see 3. above)
- All data and metadata static - calculated values cached
- Data and metadata should be represented in the most direct way available according to the standard, with reasonable leeway
- Appropriate use of accessibility, security and internationalization features
To put this negatively:
- No non-standard formats
- No non-standard parts
- No undocumented or proprietary formats, parts, elements, attributes, values, functions
- No data only available in non-standard format
- No metadata covered by the standard only available in non-standard format
- No dynamic or external data or metadata
- No obfuscated or convoluted code
- No data that unnecessarily creates disabilities, insecurity or disadvantage
MODUS relies on a distinction between data and metadata. For a word processor, the data is the text and styles and media. For a spreadsheet, the data is the numbers and formulae. For a presentation, the data is the slides and sequencing and media. File manager thumbnails? Metadata. Text on a page? Data.
What is "reasonable"? A quasi-legal test, such as what an expert in the field of documents and markup with broad experience would consider reasonable. Enough that gratuitous and deliberate or sloppy conversions would not qualify. For example, a text PDF file converted to ODF by making it a drawing with each line unconnected would not satisfy the 'reasonable representation' test.
What is "eponymous standard form"? This is the particular standard form associated with the standard. So, for example, an ODF document that had a graphic must have the graphic in ODF's dialect of SVG, even if it also provides it in real SVG Tiny and OOXML's DrawingML.
So what kinds of document would be conform to MODUS?
- An ODF document that had no extensions of any kind, that had no scripts or macros, no proprietary data formats.
- A KDE ODF document that had extra metadata in its drawings concerning the editing history of the nodes.
- An OOXML document that has customXML wrappers, alternativeContent chunks, alternative sections using OOXML MCE, embedded XBRL and HL7 data files and so on, providing it did not have any undocumented or proprietary or dynamic or external values, extensions, parts, etc
What documents would not conform?
- An ODF document that contained data in any undocumented proprietary format, or which relied on macros to generate data, or links to external data that had to be retrieved
- An OOXML spreadsheet document which did not provide calculated values but which would rely on an application to implement the particular formula language or libraries in order to read a value
- An ODF document that contained a drawing in ISO CGM vector format but did not also provide it in ODF SVG or ISO PNG or ISO JPG
- An OOXML document where MCE was used to represent data so that it was only available in non-OOXML elements.
- An ODF document where binary information was attached to a drawing in the allowed Bin64 notation which contained data items that the user would reasonably expect to be rendered by even a low-end draft, read-only application.
So MODUS squarely is aimed at interoperability, but not at a simplistic one that limits what alternative forms are allowed, or which metadata extensions are allowed. It says that all the data must be available at least in standard eponymous form: in the family.
A test would be that if there was some reference implementation of an application that strictly limited itself to displaying and editing the standard only, and which did no calculation or scripted behaviour, it could open the MODUS document and provide the user with all the information contained in the document that was defined by the standard: no data would be unavailable, no metadata defined by the standard would be unavailable (or only available in some non-standard form.)
I think this is the kind of conformance and profile that users interested in interoperability, application substitutability, and open formats for public information need.
I think we need something like MODUS because we need to support the maximum richness and adoptability of the standard formats without sacrificing openness and interoperability.