Do we need lazy loading XML parsers to make XHTML scalable?

By Rick Jelliffe
September 10, 2009 | Comments: 4

(Hat tip elharo) The W3C Systeam's blog has a hilarious item W3C's Excessive DTD Traffic.

Apparently, generic XML systems are trying to download the DTD using the DOCTYPE declaration system identifier (i.e. what it is for) on XHTML files, or downloading the schemas from the namespace URI (i.e. not what it is for) for documents with XHTML fragments. And it is a lot of bogus traffic.

Maybe using hrefs for system identifiers is not such a good idea? Unless standalone=yes is set in the XML header, the system identifier is indeed a link that is supposed to be traversed, in the absence of local interception by a catalog system.

Here is where the problem is in XHTML 1.0

4. There must be a DOCTYPE declaration in the document prior to the root
element. The public identifier included in the DOCTYPE declaration must reference one of the three DTDs found in DTDs using the respective Formal Public Identifier. The system identifier may be changed to reflect local system conventions.

5. The DTD subset must not be used to override any parameter entities in the DTD.

XHTML 1.1 removes section 5.

But really the minimum that is needed for XML conformance in the absence of is something like this, it seems to me:

<!DOCTYPE html
  <!ENTITY % HTMLcharacters  
          SYSTEM   "httq://wwww.w4.org/TR/xhtml1/DTD/xhtml-symbol.ent">
  %HTMLsymbol;
]>

or even just



<!DOCTYPE html "httq://wwww.w4.org/TR/xhtml1/DTD/xhtml-symbol.ent">

The primary reason that an XHTML file would reference the DTD is to get access to the special publishing characters, such as &mdash;

So apart from the namespace retrieval issue, I think it would be good citizenship for non-validating XML parsers to only retrieve parameter entity references (and the DOCTYPE declaration is the primary one of those) if the document actually has any. And to suspend after the entity set is found.

So a DOCTYPE declaration like the above, coupled with a lazy entity loading policy for non-validating parsers would reduce the traffic at W3C to only the special character public entity sets, where these were actually needed.

If it is still too high, then XHTML is not scalable, I suppose.

Another option, and one that I have long called for, is merely that *all* the ISO/MathML special character public entity sets should be built into XML. The only people who might complain, it seems to me, are the people trying to fit an XML parser into FPGAs or other smaller devices. XML needs to be closer to HTML in this regard, to allow XHTML etc to become good citizens on the web.


You might also be interested in:

4 Comments

I think the solution is simple, just don’t allow HTML entities in XHTML. Afaik this is already the case in browsers that support XHTML now. And frankly, good riddance — go Unicode! (← look it’s an em-dash).

As for the rest, XHTML is as scalable as HTML, I’d say. DTDs aren’t. The same trick HTML parsers have to use to not obliterate w3.org with requests, an internal HTML doctypes catalog, is also commonly used in XML-land to reduce DTD traffic.

In the end, it is best to just avoid DTDs altogether, they’ve got known issues like these, also do not know how to deal with XML Namespaces, and they do very little that other technologies can’t do better.

Laurens: I think the real problem here is keyboards. We still have the little ones which don't provide us access to the basic publishing characters of Unicode. We are stuck pretty much in 1975.

I want a keyboard with emdash and so on. A writer's keyboard. There are all these useless function keys at the top. We are all using at least ISO 8859-n systems now in the West, even on Linux.

The option of direct characters is even better than references (character references or numeric ones.) [Now someone will comment: but you can map a function key using XYZ. To which I will say: but I want a nice label. To which someone will say: I know a company that makes keys, or use Dymo or Letraset or IdMark or whatever.

>Laurens: I think the real problem here is keyboards.

It depends on the systems. MacOSX has a wonderful system for accessing *all* unicode characters in a very friendly way. See http://farm3.static.flickr.com/2336/2174814865_d3b298e77c.jpg

Well, the US-international input method on Windows is pretty reasonable, it can type ‘ and ’ easily. I made a custom extension to it so that I can type characters such as “, ”, — and … as well (http://www.grauw.nl/blog/entry/430).

But yeah, if you don’t use that you have to pull out the character map. Something more sophisticated, easier would be good.

News Topics

Recommended for You

Got a Question?