Simplistic conversion from word proccessor formats to plain text is unsafe


By Rick Jelliffe
August 16, 2009

OOXML, ODF and HTML share a common design feature, that if you strip out all the tags, you get a plain text file containing the content of the file.

This is a nifty feature, and it makes some operations like text searches much more easy.

But it is intrinsically unsafe. There are a number of ways in which text can be introduced, changed or disappeared, though each format will have a different mix of possibilities. Here are some:

  • Hide: text formatted the same color as the background will appear in the plain text
  • Graphics: text rendered into a graphic will disappear from the plain text
  • Spoof: many characters have near-duplicates in Unicode: mathematical characters and half-width characters for example. These may not display in a particular font, or be stripped out on document save if a less-than-Unicode character encoding is used.
  • Revisions and conditionals: revision text and conditional text is usually represented in just the same way as normal text but with special markup. It may appear when the XML markup is stripped.
  • Extensions: OOXML has a feature Markup Compatibility and Extensions that allows multiple alternative versions of text in different XML vocabularies. Stripping out the markup will cause all the supposed alternatives to be retained. ODF allows arbitrary extension elements, and the same thing can happen there.
  • Transcluded text: XForms and customXML allow text to be included from an outside source. This text would disappear from the plain text view.
  • Entities: at a pinch, it is conceivable that the entity mechanism could cause problem, if the conversion were made by a non-parsing string processing mechanism that just stripped tags rather than working on an XML infoset.

So what is the solution?

  • Document conversion from WP formats needs to be aware of the basic features of the WP w.r.t. text assembly: extensions, MCE, revisions, conditionals, transclusions, entities and so on.
  • Document conversion of sensitive documents may need to be checked by eye. Or checked from within an application that understands enough of the format to alert about potential hiding, spoofing, graphics.
  • The commentariat should not give the impression that simplistic import implementations of ODF and OOXML, which don't handle revisions/conditionals/extensions/MCE/entities and so on, are acceptable. They are buggy and unsafe. (At the very least, simplistic import implementations should bring these oddities to the attention of the user.)

Now of course this unsafeness only applies to documents at risk: probably your documents are not like that and this blog is just useless info. And there is a class of documents which don't need these kinds of checks: for example an OOXML or ODF document with only CP1252 repertoire characters, standard fonts, no graphics, no stylesheets that change text from default colors, no revisions or conditionals, no customXML or XForms, no MCE or extensions, and so on. These may not need to be vetted by eye.

You might also be interested in:

News Topics

Recommended for You

Got a Question?