Microsoft and the two XML patents #2

More on the I4i patent for Word's custom XML

By Rick Jelliffe
August 12, 2009 | Comments: 21

I thought I'd go through some of the ideas on the I4I patent, following up from yesterday's post.

Here is a brief description of the patent in more familiar terms to XML-ers. But a big caveat first that patent decisions can be based on details, so a description of the big picture may be enough to get the general idea only. And IANAL.

Basic idea

The basic idea is this.

Let us take an XML (the patent says SGML: no material difference) document. The patent uses this document:

<Chapter><Title>The Secret Life of Data</Title>
<Para>Data is hostile. </Para>The End</Chapter>

The patent says that there are benefits in actually separating the text content and the markup. Something like the following.

First a data store with mapped content. Just a big string:

The Secret Life of DataData is hostile.The End

Then we take all tags and their positional effect in the mapped content: this is called a metacode map. For example, if marked up in XML it might look like this:

       <metacode position="0"><![CDATA[<Chapter>]]></metacode>
       <metacode position="0"><![CDATA[<Title>]]></metacode>
       <metacode position="23"><![CDATA[</Title>]]></metacode>
       <metacode position="23"><![CDATA[<Para>]]></metacode>
       <metacode position="39"><![CDATA[</Para>]]></metacode>
       <metacode position="46"><![CDATA[</Chapter>]]></metacode>

The patent then adds claims for the various things built on top of this structure: importing a document from SGML, editing it, marking up a raw text document (presenting the user with a menu of kinds of metacodes, combining it for export, transforming it, and so on.

The patent also generalizes things. It is not just SGML tags that could be used for metacode, it could be any lexical convention. A single mapped content could have multiple metacode maps each using different addresses as boundaries. The specifics of what is an address is fuzzy.

Metacodes and points

Now there are two key technical issues I see with what this patent describes.

The first is that it is described in terms of metacodes. It defines metacodes like this:

A metacode, which includes but is not limited to a descriptive code, is an individual instruction which controls the interpretation of the content of the data, i.e., it differentiates the content. A metacode map is a multiplicity of metacodes and their addresses associated with mapped content. An address is the place in the content at which the metacode is to exert its effect.

A structured markup person, when seeing that, immediately has an "Aha": this is talking about points not ranges. If you know about the history of markup languages, you will know that points and ranges are one of the most significant issues.

A point-based system is one where tags turn on and off effects at certain points. Such as Netscape allowing out-of-order tagging <b>hello<i> </b>world</i>.

A range-based system is where the tags are paired to form paired structures. This is what SGML and XML do. When people say "HTML is an application of SGML" this is in part code for saying "HTML is based on point markup not range markup."

So the metacodes are not, technically, elements but tags: start-tags, end-tags, PIs, Comments, etc. (The patent uses the term "markup codes" which we would call "markup tags", and so I guess the idea is that a metacode is a code abstracted out from the document.) This is very explicit in the example in the patent: it shows tags not elements and ranges.


The patent says in the Summary of the Invention

Thus, in sharp contrast to the prior art the present invention is based on the practice of separating encoding conventions from the content of a document. The invention does not use embedded metacoding to differentiate the content of the document, but rather, the metacodes of the document are separated from the content and held in distinct storage in a structure called a metacode map, whereas document content is held in a mapped content area.

That seems pretty straight-forward. But then it has this puzzling sentence:

Raw content is an extreme example of mapped content wherein the latter is totally unstructured and has no embedded metacodes in the data stream.

My first reaction was "huh? so mapped content can contain tags? that doesn't fit..." but looking through the use of "stream" in the patent, it is used for data coming in from the outside world. So this seems to be merely saying that the mapped content can be made by mapping plain text as well as marked-up text.

I hope this is some use to people interested in reading through patent 5,787,449.

Please let me know if you find any other wrinkles of significance: the use of "addressing" for example, or the precise meaning of "differentiate" perhaps. I have not dealt in this with the specifics of i4i's claim against Microsoft, just what the patent is about, on the face of it. If I find out more about the specifics, I will post an entry about it.

Frankly, it is difficult to see how a patent on extracting tags and text indexes into a list relates to what Microsoft Word's customXML does: it addresses using XPath and XPath has a node data model not a point data model.

However, if you imagine saving a SAX stream in an array where the text content was all allocated in a single range, that would seem to be closer to what the patent is about, as far as I can tell.

Maybe I have missed some significant aspect buried deep within the patent...

You might also be interested in:


thanks for the write up.
This is an amazing lawsuit. This begs the question of whom is filing and approving patents, are they qualified to do so and whom is deciding whether or not there is an infringement. Does a judge in Texas no anything about technology. I'm going to patent a hashtable and sue both i4i and ms.
there's no way this will hold up to an appeal.

Reading the judgment, it seems that the judge took at every step the most ludicrous maximalist line. A metacode could be almost anything, a data structure could be almost anything, the content could be almost anything, there was no need to look at source code.

By the end of the judgment I was left thinking "what interactive XML system with any links wouldn't be included in this?" which is utterly ridiculous.

I was creating SGML systems from 1989, and the i4i patent is just as obvious then as it is now.

Just think of Emacs markers. Here is the 1996 entry from the Wayback machine: it would go further back than that.

This is a great example of why it can be impossible to do prior art search for software. There were numerous word processing systems in the 70s and 80s that took the approach described as an invention in this patent application. All of the companies that created these systems are gone and none of the systems is running. There was no Web so the only articles are in old magazines. Companies didn't publish the details of their implementations because they considered their approaches trade secrets and it also prevented competitors from easily creating converters.

"XML is clearly in the public domain," said i4i's Chairman Loudon Owen. "What we have developed at i4i is what's customarily referred to as 'customer-centric' or 'custom XML,' which is allowing people to create customer-driven schema -- we'll call it templates or forms. So, while XML is used to tag and to mark the data that's created, our technology is used to create the whole schema and the management of the data."

The invention goes beyond XML, according to Owen.

"XML in and of itself -- just like the letters in the alphabet -- is not terribly useful," he said. "This implementation leverages XML."

The basic difference between your posts and reality is that in reality, the custom XML refers to the XML tags and not to the document or the data. Custom XML means that you can use XML to create your own structure, even if you're a customer or anyone. For example, only I4i can legally create XML documents for interchanging patent documentation between patent offices. That's because if you don't have the data in the XML tree list, then you have infringed upon the patent!

Anonymous: I don't understand either what you are saying, nor what Loudon Owen is saying. It is gobbledigook.

Data location links (and many other kinds of links) were already ISO standardized by 1992 in IS10744. At that time all SGML documents had to have a DTD, so both source and target documents had to have schemas. Data location links allowed linking to arbitrary text, and HyTime was extensible.

I am waiting for an explanation of the patent in concrete industry-standard terms that makes any sense. The lack of any such explanation is odd, and smacks of fudging.

My interpretation of the patent was fairly similar to your, Rick, though I do find the language used in patents very difficult to relate to ordinary software engineering terminology. (Presumably that's part of the mystique: it's all designed to make lawyers indispensible).

The other thing I find very difficult when reading patents is to know how close two systems have to be to be considered equivalent/infringing. If the patent is interpreted very broadly then it describes vast numbers of systems developed both before and after the filing. If it is interpreted narrowly, then it's hard to see how MS Office is infringing.

In fact, this is my main intellectual problem with understanding how on earth software patents are supposed to operate: it depends on some set of rules for deciding whether two rather abstract ideas are equivalent or isomorphic, and I've never seen any computer science theory that would underpin such a notion of isomorphism.

But I don't think computer scientists get consulted. It's lawyers making it up as they go along.

Mike: I was watching that Tony Robinson show on British legal history last night, about how the regional laws were gathered into a Common Law and published under Henry II.

It seems the Americans are intent on having a pre-Henrican legal system: not only do they disregard habeas corpus (made under Henry's son) but also by allowing this kind of jurisdiction shopping they are avoiding having a common law. Of course, I am just sniping and glib. On behalf of the rest of the world, I wish the US would get its house in order on this kind of thing. You broke it, you fix it.

Reading the judgment, it seemed the deciding factor was that the judge was pissed off at the MS lawyers.

For a rationale for the system, the whole thing is a ponzi scheme where the USPTO can assume that shonky patents can be challenged and so didn't examine thoroughly enough (Reagan-era cost cutting, etc), while the courts give the benefit of the doubt to patents.

I know this is off topic from concerns about the scope of the claims in the I4i patent, but I have a different nit.

I don't see how your example works, with regard to the position attribute values. The values don't seem to be correct.

I think the positions should be 0, 0, 23, 23, 46, 46 respectively, if I understand how start and end positions might work.

Orcmid: Yes. The patent actually uses the numbers 0,0,23,23,39,46 in its method, which is different from yours or my. (I think you made a typo.)

It depends on which 'metacode' you associate the text with. The description of the algorithm of the patent is unworkable as it is: it relies on associating a single string with a single metacode. That cannot handle mixed content well, since it would associate text with the preceding tag: that is completely a point-based view, not XML's element kind of view.

But the patent is not limited to a particular method, in its "broadest aspect".

I used a different algorithm, so that the last text was associated with the </chapter>: perhaps point markup is too unsettling to my subconsious :-) But it is confusing, so I've fixed my numbers to match the patent. Thanks.

Given what I've read here and in the patent itself, I believe I understand how Microsoft is infringing.

Basically, i4i is describing a generic system which, instead of embedding XML into the content to give it structure, they put the markup in a separate area, with pointers to specific places in the content. Simple enough.

Naturally, nobody building an XML parser would do this, *unless* they wanted their parser to handle arbitrary markup that it does not know about *and* they don't like DTDs and normal notions of extensibility and such. This is Microsoft all over.

Basically, any time you encounter markup that you don't understand, you pull it out and store it with a pointer. Later, you put the markup back into place. This ensures that the custom markup survives without interfering with what you're doing yourself. A bass-ackwards implementation to be sure, but given a certain frame of mind, I can see it.

Otto: Err, Why is anyone supposed to be limited to "normal ideas" about anything technical? And what is the normal ideas now was not the normal ideas 15 years ago: hence my point about HyTime in the companion blog entry.

You are right that this is not to do with parsing. It relates to editing or guided transformations/manipulations.

I don't get your point in your last paragraph: can you elucidate?

That's actually pretty clever-- but I'm having a hard time seeing what practical advantage it is!

It's nice that the markup is separated from the text text because you could put the text and the start of the file and the markup at the end. Then, it would be faster to read or search just the text-- especially if there is a lot of markup.

But honestly this is better filed under "Stupid Programmer Tricks" than $200 million patents.

Personally, I think we would be better off without software patents.

Glen Pepicelli

Glen: The advantage is that you can have the data stored "pure" (no formatting markup) and use industry standard schemas for that information.

In other words, it allows Word to act as a kind of forms editor, where the data is available as a single concise ZIP file in whatever schema is suitable. And it does this with no programming.

(See XForms for the W3C technology in a similar area.)

Isn't the DOC format for Word 97-2000 set up this very way? That is, with the text separated from formatting, and pointers back into the text block for style/font/structure elements? That was what I took away from reading the patent and comparing it with the published data on the DOC format, and opening a DOC file in a hex editor. It sure looked like what the patent suggests.

Given that, and given the date that Microsoft released Word 97 versus when i4i marketed s4 (both around 1997), does the patent have any standing? Who did it first?

John: Binary formats are not impacted by the patent, as the appeal decision notes. It is markup based systems (XML, SGML).

The DOC format already allows Word to store the locations of style and font occurrences elsewhere in the file and point those back into the text stream. It's not a great leap to translate those to HTML or CSS markup. Or for that matter to set similar pointers back to the text stream for other occurrences, like SGML or XML tagging, and house them inside a binary format.

I know one denotes formatting and the other denotes structure, but where you're talking about elements represented by external pointers back into a text stream, they're really all the same.

I guess what I am getting at is that the patent seems indefensible from that perspective.

John: Yes, the only way I see it could make any sense is if we have the wrong end of the stick and somehow the patent means something completely different to what it seems to...

Isn't the DOC format for Word 97-2000 set up this very way? That is, with the text separated from formatting, and pointers back into the text block for style/font/structure elements? That was what I took away from reading the patent and comparing it with the published data on the DOC format, and opening a DOC file in a hex editor. It sure looked like what the patent suggests.

Given that, and given the date that Microsoft released Word 97 versus when i4i marketed s4 (both around 1997), does the patent have any standing? Who did it first?

I hate software patents, and despite trying; any intelligent comment I try to put in ends up in a tirade. I think this case had the potential to really make a difference and engage people in what is a mssive debate (Software patenting is the most important area of corporate law at the beginning of this century) , but it's all just went pear-shaped.

When I look at my screen, right now, I see just a string of text in this window, no embedded tags. Hence, I know that we have been using external lists of tags with character pointer to strings of text, otherwise this text box would not work, in the web.

It is simply not clear or obvious what is new, except the use of mark up languages as part of a character pointer format.

The key insight here was that they created a markup system based on positional values within a text file (as some have mentioned...pointers to the actual character position of the start and end tags), as opposed to a tag based approach.

As someone else mentioned, this is a nifty programmer hack. You can keep the document clean of markup, if you just note the position of each formatting element in the document, and put that data in a separate doc. However, it is absurd to think that this is something worthy of patenting.

For the layman, it's the equivalent of saying that in order to map the location of a store within a city, I am going to number all of the stores sequentially, starting with a pre-defined store, and then assign addresses to those stores in a separate document, based on whatever number I gave it.

The takeaway here is the entire patent office needs to have any technology based business taken out of its purview. It cannot be trusted to issue patents in this area. The government either needs to create a separate patent division, or outsource this to someone who knows what they're talking about. No amount of fancy lawyer speak is going to change the fact that this company is wasting people's time and money over a patent that never should have been issued in the first place. This is a drain on society and its resources. i4i should be ashamed of itself, and its customers should ban its products. I'm no MS lover, but this type of abuse of the system is completely ridiculous.

News Topics

Recommended for You

Got a Question?