Why I think XML 1.0 (fifth edition) is wrong-headed

By Rick Jelliffe
December 17, 2008 | Comments: 5

I would like to join Elliotte Rusty Harold, James Clark, Tim Bray, Michael Kay and David Carlisle in being deprecating or being disappointed by XML 1.0 (fifth edition).

The fifth edition loosens up rules about characters that can appear in names. This means that anyone who actually creates such documents will find they are not accepted by approx 100% of XML parsers out in the world, as of now. Guaranteed non-interoperability in the name of better inclusiveness.

The old rule was this:


[4] NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender
[5] Name ::= (Letter | '_' | ':') (NameChar)*

which referenced the Unicode properties, giving

[84]   	Letter	   ::=   	 BaseChar | Ideographic
[85]   	BaseChar	   ::=   	[#x0041-#x005A] | [#x0061-#x007A] | [#x00C0-#x00D6] | [#x00D8-#x00F6] | [#x00F8-#x00FF] | [#x0100-#x0131] | [#x0134-#x013E] | [#x0141-#x0148] | [#x014A-#x017E] | [#x0180-#x01C3] | [#x01CD-#x01F0] | [#x01F4-#x01F5] | [#x01FA-#x0217] | [#x0250-#x02A8] | [#x02BB-#x02C1] | #x0386 | [#x0388-#x038A] | #x038C | [#x038E-#x03A1] | [#x03A3-#x03CE] | [#x03D0-#x03D6] | #x03DA | #x03DC | #x03DE | #x03E0 | [#x03E2-#x03F3] | [#x0401-#x040C] | [#x040E-#x044F] | [#x0451-#x045C] | [#x045E-#x0481] | [#x0490-#x04C4] | [#x04C7-#x04C8] | [#x04CB-#x04CC] | [#x04D0-#x04EB] | [#x04EE-#x04F5] | [#x04F8-#x04F9] | [#x0531-#x0556] | #x0559 | [#x0561-#x0586] | [#x05D0-#x05EA] | [#x05F0-#x05F2] | [#x0621-#x063A] | [#x0641-#x064A] | [#x0671-#x06B7] | [#x06BA-#x06BE] | [#x06C0-#x06CE] | [#x06D0-#x06D3] | #x06D5 | [#x06E5-#x06E6] | [#x0905-#x0939] | #x093D | [#x0958-#x0961] | [#x0985-#x098C] | [#x098F-#x0990] | [#x0993-#x09A8] | [#x09AA-#x09B0] | #x09B2 | [#x09B6-#x09B9] | [#x09DC-#x09DD] | [#x09DF-#x09E1] | [#x09F0-#x09F1] | [#x0A05-#x0A0A] | [#x0A0F-#x0A10] | [#x0A13-#x0A28] | [#x0A2A-#x0A30] | [#x0A32-#x0A33] | [#x0A35-#x0A36] | [#x0A38-#x0A39] | [#x0A59-#x0A5C] | #x0A5E | [#x0A72-#x0A74] | [#x0A85-#x0A8B] | #x0A8D | [#x0A8F-#x0A91] | [#x0A93-#x0AA8] | [#x0AAA-#x0AB0] | [#x0AB2-#x0AB3] | [#x0AB5-#x0AB9] | #x0ABD | #x0AE0 | [#x0B05-#x0B0C] | [#x0B0F-#x0B10] | [#x0B13-#x0B28] | [#x0B2A-#x0B30] | [#x0B32-#x0B33] | [#x0B36-#x0B39] | #x0B3D | [#x0B5C-#x0B5D] | [#x0B5F-#x0B61] | [#x0B85-#x0B8A] | [#x0B8E-#x0B90] | [#x0B92-#x0B95] | [#x0B99-#x0B9A] | #x0B9C | [#x0B9E-#x0B9F] | [#x0BA3-#x0BA4] | [#x0BA8-#x0BAA] | [#x0BAE-#x0BB5] | [#x0BB7-#x0BB9] | [#x0C05-#x0C0C] | [#x0C0E-#x0C10] | [#x0C12-#x0C28] | [#x0C2A-#x0C33] | [#x0C35-#x0C39] | [#x0C60-#x0C61] | [#x0C85-#x0C8C] | [#x0C8E-#x0C90] | [#x0C92-#x0CA8] | [#x0CAA-#x0CB3] | [#x0CB5-#x0CB9] | #x0CDE | [#x0CE0-#x0CE1] | [#x0D05-#x0D0C] | [#x0D0E-#x0D10] | [#x0D12-#x0D28] | [#x0D2A-#x0D39] | [#x0D60-#x0D61] | [#x0E01-#x0E2E] | #x0E30 | [#x0E32-#x0E33] | [#x0E40-#x0E45] | [#x0E81-#x0E82] | #x0E84 | [#x0E87-#x0E88] | #x0E8A | #x0E8D | [#x0E94-#x0E97] | [#x0E99-#x0E9F] | [#x0EA1-#x0EA3] | #x0EA5 | #x0EA7 | [#x0EAA-#x0EAB] | [#x0EAD-#x0EAE] | #x0EB0 | [#x0EB2-#x0EB3] | #x0EBD | [#x0EC0-#x0EC4] | [#x0F40-#x0F47] | [#x0F49-#x0F69] | [#x10A0-#x10C5] | [#x10D0-#x10F6] | #x1100 | [#x1102-#x1103] | [#x1105-#x1107] | #x1109 | [#x110B-#x110C] | [#x110E-#x1112] | #x113C | #x113E | #x1140 | #x114C | #x114E | #x1150 | [#x1154-#x1155] | #x1159 | [#x115F-#x1161] | #x1163 | #x1165 | #x1167 | #x1169 | [#x116D-#x116E] | [#x1172-#x1173] | #x1175 | #x119E | #x11A8 | #x11AB | [#x11AE-#x11AF] | [#x11B7-#x11B8] | #x11BA | [#x11BC-#x11C2] | #x11EB | #x11F0 | #x11F9 | [#x1E00-#x1E9B] | [#x1EA0-#x1EF9] | [#x1F00-#x1F15] | [#x1F18-#x1F1D] | [#x1F20-#x1F45] | [#x1F48-#x1F4D] | [#x1F50-#x1F57] | #x1F59 | #x1F5B | #x1F5D | [#x1F5F-#x1F7D] | [#x1F80-#x1FB4] | [#x1FB6-#x1FBC] | #x1FBE | [#x1FC2-#x1FC4] | [#x1FC6-#x1FCC] | [#x1FD0-#x1FD3] | [#x1FD6-#x1FDB] | [#x1FE0-#x1FEC] | [#x1FF2-#x1FF4] | [#x1FF6-#x1FFC] | #x2126 | [#x212A-#x212B] | #x212E | [#x2180-#x2182] | [#x3041-#x3094] | [#x30A1-#x30FA] | [#x3105-#x312C] | [#xAC00-#xD7A3]
[86]   	Ideographic	   ::=   	[#x4E00-#x9FA5] | #x3007 | [#x3021-#x3029]
[87]   	CombiningChar	   ::=   	[#x0300-#x0345] | [#x0360-#x0361] | [#x0483-#x0486] | [#x0591-#x05A1] | [#x05A3-#x05B9] | [#x05BB-#x05BD] | #x05BF | [#x05C1-#x05C2] | #x05C4 | [#x064B-#x0652] | #x0670 | [#x06D6-#x06DC] | [#x06DD-#x06DF] | [#x06E0-#x06E4] | [#x06E7-#x06E8] | [#x06EA-#x06ED] | [#x0901-#x0903] | #x093C | [#x093E-#x094C] | #x094D | [#x0951-#x0954] | [#x0962-#x0963] | [#x0981-#x0983] | #x09BC | #x09BE | #x09BF | [#x09C0-#x09C4] | [#x09C7-#x09C8] | [#x09CB-#x09CD] | #x09D7 | [#x09E2-#x09E3] | #x0A02 | #x0A3C | #x0A3E | #x0A3F | [#x0A40-#x0A42] | [#x0A47-#x0A48] | [#x0A4B-#x0A4D] | [#x0A70-#x0A71] | [#x0A81-#x0A83] | #x0ABC | [#x0ABE-#x0AC5] | [#x0AC7-#x0AC9] | [#x0ACB-#x0ACD] | [#x0B01-#x0B03] | #x0B3C | [#x0B3E-#x0B43] | [#x0B47-#x0B48] | [#x0B4B-#x0B4D] | [#x0B56-#x0B57] | [#x0B82-#x0B83] | [#x0BBE-#x0BC2] | [#x0BC6-#x0BC8] | [#x0BCA-#x0BCD] | #x0BD7 | [#x0C01-#x0C03] | [#x0C3E-#x0C44] | [#x0C46-#x0C48] | [#x0C4A-#x0C4D] | [#x0C55-#x0C56] | [#x0C82-#x0C83] | [#x0CBE-#x0CC4] | [#x0CC6-#x0CC8] | [#x0CCA-#x0CCD] | [#x0CD5-#x0CD6] | [#x0D02-#x0D03] | [#x0D3E-#x0D43] | [#x0D46-#x0D48] | [#x0D4A-#x0D4D] | #x0D57 | #x0E31 | [#x0E34-#x0E3A] | [#x0E47-#x0E4E] | #x0EB1 | [#x0EB4-#x0EB9] | [#x0EBB-#x0EBC] | [#x0EC8-#x0ECD] | [#x0F18-#x0F19] | #x0F35 | #x0F37 | #x0F39 | #x0F3E | #x0F3F | [#x0F71-#x0F84] | [#x0F86-#x0F8B] | [#x0F90-#x0F95] | #x0F97 | [#x0F99-#x0FAD] | [#x0FB1-#x0FB7] | #x0FB9 | [#x20D0-#x20DC] | #x20E1 | [#x302A-#x302F] | #x3099 | #x309A
[88]   	Digit	   ::=   	[#x0030-#x0039] | [#x0660-#x0669] | [#x06F0-#x06F9] | [#x0966-#x096F] | [#x09E6-#x09EF] | [#x0A66-#x0A6F] | [#x0AE6-#x0AEF] | [#x0B66-#x0B6F] | [#x0BE7-#x0BEF] | [#x0C66-#x0C6F] | [#x0CE6-#x0CEF] | [#x0D66-#x0D6F] | [#x0E50-#x0E59] | [#x0ED0-#x0ED9] | [#x0F20-#x0F29]
[89]   	Extender	   ::=   	#x00B7 | #x02D0 | #x02D1 | #x0387 | #x0640 | #x0E46 | #x0EC6 | #x3005 | [#x3031-#x3035] | [#x309D-#x309E] | [#x30FC-#x30FE] 

The new rule is this:


[4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5] Name ::= NameStartChar (NameChar)*

So what is wrong about it?

Goal 1

Now that XML is so widely deployed, Goal 1 of XML XML shall be straightforwardly usable over the Internet. surely must mean that a document marked XML 1.0 can be expected to work with XML 1.0 processors. Not so, according to the XML Core WG: forget the idea of clear labels.

Forget any "guarantee" of interoperability. It never existed.

I think people will have a better understanding now of why the W3C does not refer to its specifications as 'standards' but as 'recommendations'. W3C doesn't have the stability or the discipline or the processes to prevent this kind of rubbish, which makes it a great organization for midwifing technologies and a terrible one for maintaining important ones.

Goal 2

There is a sharp divide between how database people think about identifiers and how programmers, especially text programmers, think about identifiers. Database people think that an identifier is a key and that any restriction on identifiers is unnecessary: they are just string fields. Programmers see identifiers as needing to be lexically distinct from other tokens in a language, to allow parsing. The XML restriction on element, attribute and entity identifiers starting with digits comes out of this.

This in turn maximizes the chances that an XML identifier can, for example, be used directly as a name in a dynamic programming language. This was a technique available in JavaScript, for example, and I see examples of it from time to time: your document is <x id="fred"/> and the DOM object is available, for example fred.font.

Goal 2 of XML XML shall support a wide variety of applications. to me, at least, indicates that XML should attempt to support this kind of use. The way to do this is to adopt Unicode's guidelines in these areas, which were in turn, a response to XML's rules. That way everyone converges on a common set, with maximal chance that the tokens from one will be acceptable.

There will always be a mismatch in the ASCII range, but Goal 6 is one of conservatism and even deference.

Goal 6

XML Goal 6 XML documents should be human-legible and reasonably clear.

Unless I have made a mistake, it is now possible to have a document with this as a start-tag and it all be well-formed:

Or you can have a start tag with "characters" that are not even allocated to Unicode code points. The old rules didn't allow non-graphical characters, but now we have characters that are so non-graphical they don't even exist. You can have a well-formed XML document that uses element names that are invisible in any font in the world. You could have an start-tag that displays as <>!

Of course, you could at the moment have an element name like _-.-.-.-._ but that is because those characters are needed for isolating languages like English.

Goal 4: who does it actually help?

These changes are nominally to support Ethiopians and so on who want the advantages of Native Language Markup, but who don't want XML 1.1. That is laudable, but they actually will be worse off under the fifth edition, for the reason I gave above: the document does not label itself with enough information to know whether an XML processor can handle it or not.

It is jaw-droppingly muddle-headed.

This is the same kind of thinking that gave us the character set hell of the pre-XML internet: that creating problems there doesn't matter if you are solving a problem here. No amount of feeling wrong justifies it.

So if it will make life worse for the Ethiopians, etc, who does it help? For years there has been a push, mainly from the database vendors, to simplify the rules, They want simpler checking, and indeed this may fit under Goal 7 It shall be easy to write programs which process XML documents.. Database vendors don't want anything to slow up their systems: traditionally they left issues of figuring out whether some data was in the correct encoding to application programmers (who, having no training or experience then fail to do anything.) A highly respected writer who worked for one of them once told me that their bottom line was increasing their benchmarks, and data integrity was not an issue.

Now it could be argued, I suppose that the simpler rules would encourage more checking; however using a coarse, block-level filter has always been available. These rules don't stop that.

For implementation, there are techniques available that reduce the checking (i.e. you only need to check that identifiers in end-tags match the corresponding identifier in the start tag, you don't need to check each character against the naming rules) and indeed the cost of checking name characters < U+00FF (the most common case) seems identical under the new rules to the old. So the new rules may result in simpler code, but not faster for the majority of documents.

So what do-gooder itch is being scratched here, which thinks that substituting an unreliable broken system will suit the needs of marginal-script users more than a working, workable one with clear limits? Have the unpragmatic idealists been suckered by big business who exclude interests other than their own? Are inveterate tinkerers in the ascendancy at W3C? I suspect not—no itch, no suckers, no dilettantism. But I don't see that this is actually addressing any internationalization requirement at all: quite the reverse. Internationalization is a face-saving cover story, an alibi.

A better way?

As I have written before, this is a failure of the XML versioning system. XML 1.0 edition 5 says, in effect, that version number that looked like it would provide clarity and protection from incompatible versions?...oh that is all just a joke it doesn't do anything. I don't know how many times I have said this, but the solution to this would have been, one year ago, or two years, or five, or seven, or even now, to clearly add a minor/major version numbering system so that the incoming generation of parsers support future compatible or incompatible changes. The current versioning system doesn't work to the extent that it is not abandoned by this fifth edition.

But this is a long running problem, and I think the better approach would be to define a coarse filter, based on Unicode blocks, which an XML processor could use in lieu of the full WF-checks. For example, a 2^8 table of booleans to check the ASCII/Latin1 block and a 2^9 table to disallow various punctuation blocks. Tightly-coupled applications which could not, for whatever reason, enforce the full WF rules could use these.

But the formal WF rules would align with the Unicode recommendations on character properties: allowing whatever the libraries of the platform or application allowed. The characters that are on the bleeding edge will always have deployment problems.

Doesn't this water down WF just as much as the fifth edition does? Well, consider the case of a pipeline of processes that pass XML between them and that pass a DOM between them: if they pass the DOM, there is no checking at all. It seems reasonable to be to have lesser checking for tightly-coupled applications, but to have more stringent firewalling for loosely-coupled processes and public data: stranger danger.

Finally, there is another mental sumersault on display. There is this idea that XML 1.1 failed because it did more than the minimum to declare victory. That is complete rubbish. It "failed" because existing parsers would not support the documents. But that is exactly the same problem that XML 1.0 (fifth edition) documents that actually use these change! Sorry, it just does not compute.


You might also be interested in:

5 Comments

Isn't the whole point of establishing standards to give us all guidelines to aspire to, a 'safety point' that we can all agree will work among the widest set? It's disappointing that it seems the whole notion of a standard is going to be violated with this. Thanks for highlighting this issue.

The change does not affect existing XML documents: all existing well-formed documents continue to be well-formed in 5e.

Fifth edition does in fact introduce the idea of major/minor version numbers, albeit rather quietly ad

Moving to Unicode 5 and later isn't about political correctness. Yes, users of the Japanese keyboard will be affected too: over 64,000 Kanji characters were added, as well as the word-separator blob that's on the keyboard. And of course many other people.

XML 1.1 wasn't being used because people didn't implement it. Those same people were not about to implement an XML 1.2 (we asked). So this is a compromise, to get support from implementors. Albeit an ugly hack.

Encouraging people to use a "broader filter" and not do well-formedness checking on character sets would seem a much worse approach to me, and would lead to much poorer interoperability. And basing XML on the Unicode blocks has the problem that Unicode continues to add blocks.

You are right, and we all agree, I think, that standing still wasn't an option. We had to act. I went to the XML Core Working Group and proposed action in this area. And for my part I think the XML Core Working Group acted as wisely and carefully as anyone could have done, and have produced a specification that is as compatible as possible whilst still addressing the needs of today. And a lot more likely to get adoption than the XML 1.2 solution I had proposed, by the way.

We'll see who ends up supporting it. Already there seems to be more support for 5th edition than for 1.1, though.

Liam

Excellent observations. Do version numbers even mean anything anymore whne you get to the fifth edition of it?

Thanks for writing this!!

Liam: I would be happier if the spec came with a big red note like

"WARNING: It may take six years before software using the current generation of XML 1.0 processors is obsolescent and replaced, therefore document owners needing to continue their existing levels of interoperability are advised to use XML 1.0 (4th edition) until then."

That puts it out in the open. All the XML 1.0 (fifth edition) does is remove the ability for a processor to clearly report why their document fails. Instead of saying "I don't support XML 1.1" the user now gets the messay "Your document is not XML 1.0". How is that actually helpful to anyone?

You cannot expect adoption until there is a generation change. And change cannot happen unless Microsoft, Java, Firefox and libxml are on-side. XML 1.1 did not have either time nor the buy-in (MS were the villain IIRC) and those issues apply just as much to XML 1.0 fifth edition.

B.t.w. I don't know how 64,000 kanji can have been added, I thought Unicode 5.0 only had 40,00 ideographs in Plane 2. In fact, there is a strong argument to be made that the Plane 2 Ideographic characters are positively bad for use as element names. All the CJK countries have various national lists of characters that are allowed/suitable for certain uses: children's books, newspapers, names, school material and so on.

These rare, antique or specialist characters may be be obscure but is not just obscure like "ammonal" is obscure, but also obscure in the sense that the user may have to arrange special fonts in order to read the markup, special dictionaries in order to be able to enter that character in the name (NCRs not being available, rightly). And they will have to have systems that support surrogates anyway (for the UTF-16 using platforms like Java and .NET)

By allowing the Plane 2 characters, the XML WG has not said "This is not our business to understand the implications of these characters" as it has tried to. If it wanted to do that, it would say "We will follow Unicode, and liaise about our needs and concerns with them." What XML WG have done is said "We think these characters are not problematic and fine to use, no matter what the Unicode Consortium or others say."

Rick, thanks for replying. We can't predict the future, and your six years could be sixty, or it could be six months.

There's no way to make changes without some compatibility problems. That's not a reason to stay still. It means one has to weigh the needs against the problems. My regret here is that XML 5e wasn't XML 2e. We got XML 1.1 badly wrong, and we misjudged people's willingness to go with a new version of XML, so this is another approach.

The World Wide Web is for everyone, and XML is an important part of that; we stand by that, and cannot accept anything less... international support, like accessibility, is a must, not a maybe. Being stuck on Unicode 2.1 isn't an option, and I make no apology for moving forward. You're right that the best way forward is unclear. Let's see if we can make 5e work.

Best,

Liam

News Topics

Recommended for You

Got a Question?