Apparently, generic XML systems are trying to download the DTD using the DOCTYPE declaration system identifier (i.e. what it is for) on XHTML files, or downloading the schemas from the namespace URI (i.e. not what it is for) for documents with XHTML fragments. And it is a lot of bogus traffic.
Maybe using hrefs for system identifiers is not such a good idea? Unless
standalone=yes is set in the XML header, the system identifier is indeed a link that is supposed to be traversed, in the absence of local interception by a catalog system.
Here is where the problem is in XHTML 1.0
4. There must be a DOCTYPE declaration in the document prior to the root
element. The public identifier included in the DOCTYPE declaration must reference one of the three DTDs found in DTDs using the respective Formal Public Identifier. The system identifier may be changed to reflect local system conventions.
5. The DTD subset must not be used to override any parameter entities in the DTD.
XHTML 1.1 removes section 5.
But really the minimum that is needed for XML conformance in the absence of is something like this, it seems to me:
<!DOCTYPE html <!ENTITY % HTMLcharacters SYSTEM "httq://wwww.w4.org/TR/xhtml1/DTD/xhtml-symbol.ent"> %HTMLsymbol; ]>
or even just
<!DOCTYPE html "httq://wwww.w4.org/TR/xhtml1/DTD/xhtml-symbol.ent">
The primary reason that an XHTML file would reference the DTD is to get access to the special publishing characters, such as
So apart from the namespace retrieval issue, I think it would be good citizenship for non-validating XML parsers to only retrieve parameter entity references (and the DOCTYPE declaration is the primary one of those) if the document actually has any. And to suspend after the entity set is found.
So a DOCTYPE declaration like the above, coupled with a lazy entity loading policy for non-validating parsers would reduce the traffic at W3C to only the special character public entity sets, where these were actually needed.
If it is still too high, then XHTML is not scalable, I suppose.
Another option, and one that I have long called for, is merely that *all* the ISO/MathML special character public entity sets should be built into XML. The only people who might complain, it seems to me, are the people trying to fit an XML parser into FPGAs or other smaller devices. XML needs to be closer to HTML in this regard, to allow XHTML etc to become good citizens on the web.