I thought I'd go through some of the ideas on the I4I patent, following up from yesterday's post.
Here is a brief description of the patent in more familiar terms to XML-ers. But a big caveat first that patent decisions can be based on details, so a description of the big picture may be enough to get the general idea only. And IANAL.
The basic idea is this.
Let us take an XML (the patent says SGML: no material difference) document. The patent uses this document:
<Chapter><Title>The Secret Life of Data</Title> <Para>Data is hostile. </Para>The End</Chapter>
The patent says that there are benefits in actually separating the text content and the markup. Something like the following.
First a data store with mapped content. Just a big string:
The Secret Life of DataData is hostile.The End
Then we take all tags and their positional effect in the mapped content: this is called a metacode map. For example, if marked up in XML it might look like this:
<map> <metacode position="0"><![CDATA[<Chapter>]]></metacode> <metacode position="0"><![CDATA[<Title>]]></metacode> <metacode position="23"><![CDATA[</Title>]]></metacode> <metacode position="23"><![CDATA[<Para>]]></metacode> <metacode position="39"><![CDATA[</Para>]]></metacode> <metacode position="46"><![CDATA[</Chapter>]]></metacode> </map>
The patent then adds claims for the various things built on top of this structure: importing a document from SGML, editing it, marking up a raw text document (presenting the user with a menu of kinds of metacodes, combining it for export, transforming it, and so on.
The patent also generalizes things. It is not just SGML tags that could be used for metacode, it could be any lexical convention. A single mapped content could have multiple metacode maps each using different addresses as boundaries. The specifics of what is an address is fuzzy.
Metacodes and points
Now there are two key technical issues I see with what this patent describes.
The first is that it is described in terms of metacodes. It defines metacodes like this:
A metacode, which includes but is not limited to a descriptive code, is an individual instruction which controls the interpretation of the content of the data, i.e., it differentiates the content. A metacode map is a multiplicity of metacodes and their addresses associated with mapped content. An address is the place in the content at which the metacode is to exert its effect.
A structured markup person, when seeing that, immediately has an "Aha": this is talking about points not ranges. If you know about the history of markup languages, you will know that points and ranges are one of the most significant issues.
A point-based system is one where tags turn on and off effects at certain points. Such as Netscape allowing out-of-order tagging
A range-based system is where the tags are paired to form paired structures. This is what SGML and XML do. When people say "HTML is an application of SGML" this is in part code for saying "HTML is based on point markup not range markup."
So the metacodes are not, technically, elements but tags: start-tags, end-tags, PIs, Comments, etc. (The patent uses the term "markup codes" which we would call "markup tags", and so I guess the idea is that a metacode is a code abstracted out from the document.) This is very explicit in the example in the patent: it shows tags not elements and ranges.
The patent says in the Summary of the Invention
Thus, in sharp contrast to the prior art the present invention is based on the practice of separating encoding conventions from the content of a document. The invention does not use embedded metacoding to differentiate the content of the document, but rather, the metacodes of the document are separated from the content and held in distinct storage in a structure called a metacode map, whereas document content is held in a mapped content area.
That seems pretty straight-forward. But then it has this puzzling sentence:
Raw content is an extreme example of mapped content wherein the latter is totally unstructured and has no embedded metacodes in the data stream.
My first reaction was "huh? so mapped content can contain tags? that doesn't fit..." but looking through the use of "stream" in the patent, it is used for data coming in from the outside world. So this seems to be merely saying that the mapped content can be made by mapping plain text as well as marked-up text.
I hope this is some use to people interested in reading through patent 5,787,449.
Please let me know if you find any other wrinkles of significance: the use of "addressing" for example, or the precise meaning of "differentiate" perhaps. I have not dealt in this with the specifics of i4i's claim against Microsoft, just what the patent is about, on the face of it. If I find out more about the specifics, I will post an entry about it.
Frankly, it is difficult to see how a patent on extracting tags and text indexes into a list relates to what Microsoft Word's customXML does: it addresses using XPath and XPath has a node data model not a point data model.
However, if you imagine saving a SAX stream in an array where the text content was all allocated in a single range, that would seem to be closer to what the patent is about, as far as I can tell.
Maybe I have missed some significant aspect buried deep within the patent...