James Clark's blog XML v the Web responding to Norm Walsh's Deprecating XML has stimulated much discussion over at the XML-DEV geekfest maillist. The basic idea is whether it is good, bad or indifferent that JSON is taking over several kinds of data transfers that XML had been used for, and in particular about whether JSON shows up XML 1.0's complexity: Time to overhaul XML?
The last big time this was discussed was 11 years ago, and I think some of the same arguments are floating about again. I actually implemented in some products a "simplified" dialect of XML called Extended Concrete Syntax (ECS) about 9 years ago, but I am not a minimalist.*
Anyway, rather than dissecting corpses of old arguments, I thought I'd figure out what I'd like to see in a re-developed XML. **
See here is my armchair redesign: New XML which I call Nuke!
Nuke is a mix of XML and JSON, with several new ideas thrown in. It allows better streaming, smaller parse tables, terser markup, and gives the document creator a richer and freeer set of tags types than XML or JSON.
A Nuke file makes a single rooted document. The root element has the name "" (empty string) and is implied for every document. (It takes the role of the root node in XPath.) So a Nuke file may contain multiple "top-level" elements under this implied root. This makes it suitable for infinite streams and log files created by appending a series of top-level elements.
A streaming API for Nuke streams may present the document progressively as an iterator over the top-level blocks. This allows web use, where it is necessary to render information as soon as it arrives, rather than waiting for all the data.
Well-formedness is scoped per "top-level" element. A well-formedness error in one such branch makes that branch in error, but does not effect the status of subsequent or prior branches.
Nuke is thus less draconian that XML WF, but still strict enough to reject bad branches. The creator of the document can decide what granularity of Draconian behaviour to apply by choosing to use a single top-level element (XML style) or to use multiple independent top-level elements.
A Nuke document is UTF-8 in its external form. Inside a program, after parsing, it would typically use UTF16.
A non-UTF-8 byte sequence, in particular, a single byte of value 0xA0 to 0xFF should be treated as the Latin1 character on reading. This allows files with trivial corruptions by added ISO8859-1 characters to be handled. This is a recoverable error.
Apart from whitespace characters, no characters from C0 or C1 code ranges may be used. The characters may be added by numeric character references only. This is to allow the encoding redundancy necessary to detect many simple encoding errors.
Character references may use XML hex numeric character references (not decimal.) All W3C entity sets (the ISO sets augmented by MathML) are predefined.
(No Unicode character normalization form is assumed. Document producers should adopt the appropriate one voluntarily.)
Schemas, stylesheets and linkbases
There are no DOCTYPE declarations. Schemas can be added using the W3C processing instruction. Stylesheets can be added using the W3C processing instruction. Linkbases can be added using ...
Element and Attribute Structure
Most simple XML documents (with no DOCTYPE declarations) are Nuke documents.
All JSON documents are Nuke documents. The JSON data is added as structured attributes of the implied anonymous root element.
XML's attribute syntax is extended to also allow JSON data in a start tag after any XML attributes. A streaming API should present the XML attributes and JSON structured attributes in a common way, and make them available as part of the same even that processes the start tag. The designer of the document can customize data availability in streaming use by choosing to use attributes or elements. (A JSON field name cannot be the same as a defined prefix used for namespaces.)
Namespaces follow XML Namespaces, with the exception that no default namespaces are allowed on elements. An element or attribute name which is in a namespace must have a prefix. This simplifies the rules compared to XML and prevents confusion and some kinds of rebasing issues. It is an error for a prefix to be re-bound to a different URI in the same top-level branch.
Tokens and parsing
Apart from the characters in the first 512 characters of Unicode, all parsing is based on character class, determined at block level, or from the unicode symmetrical swap property (eg "[" is the symetrical swap of "]".) Therefore, only two 512-entry tables + a swap list are required to determine the class of any character from the Unicode BMP. Characters outside the Unicode BMP may be used for data but not for markup purposes (outside literals.)
Content = Data + Markup
Markup is any sequence that follows the following production:
markup ::= "<" ("/ " | "!")? (( tag-type markup-contents tag-type ) | ( markup contents ) "/"?)? ">" tag-type ::= \p(symbol)+ markup contents ::= name s XML-attributes* JSON-attributes name ::= XML-name | delimited-string
(Note for the tag-type, the first occurrence and the second should match. Symetrica swap characters should be swapped.)
So the following XML-ish tags are allowed:
<x a="b"> </x> <!-- x y z --> <?x y z?> <![CDATA[ xxx ]]>
However in the case of the comments, the comment ends with "-->" rather than XML's "--" rule. With the CDATA section, the tag contains CDATA[ xxx ]. A tag may not have a direct < character, therefore even though a tag that looks like the XML CDATA tag is allowed, it would be a markup error if it had something that looked like a subelement such as <![CDATA[ xxx<x/>xxx ]]>
This free syntax allows the user to extend the kinds of annotative tags arbitrarily. For example, the following is allowed:
<x> <@@@ some particular information @@@> </x>
This is a tag whose type-name is "@@@".
Symetrical swapping allows the following kinds of tagging:
<x> <!--[some particular information ]--> </x>
This is a tag whose type-name is "!--[".
The following type-names are pre-defined: "--" is a comment, "?" is a PI, "[" should not be used. Comments, PIs, etc are subsumed under the more generic category of "free tags".
Based on the lexical rules above, the parser can create a tree of elements, like XML. Every start tag should have an explicit matching end-tag, the short form </> is allowed for the end-tag. The empty A document creator may choose to use these, according to their own trade-offs.
The elements have attributes like XML but also may have structured attributes like JSON. The XML attribute syntax is extended to allow element and attribute names in string delimiters (therefore allowing character references or perhaps whitespace in names) and attribute values may also be the JSON unquoted tokens (numbers, boolean), i.e. duck typing.)
Mixed content is allowed.
The parser of any terminal client application, such as a browser, should remove any tag starting with <! (such as comments or left-over simple doctype declarations) from the information passed to the client.
The following is a single Nuke document.
<!-- This looks like XML ? --> <purchase-order> <amount>€42.56</amount> </purchase order> <!-- The following is a second top-level branch with quoted names and short tags.-->
<total>€ <=total()=> </>
<!-- The following has some JSON and some other element-->
<purchase-order name:[ 1 2 3 "4"]>
<x /><y />
<=== insert file.xml ===>
Reserved attributesAttributes with the "nuke:" prefix are reserved.
- nuke:ignore signals that the element is incomplete in some way, and that the application should not process it normally. This functions like "commenting out, but at an element level where the information is still present to the application.
A Nuke archive is a simple ZIP archive, with no patented features allowed and file names interpreted as UTF-8. The root of this archive can contain one Nuke file, eg xxx.nuke. Specific language versions of this file make be found at the same level using a language prefix, eg xxx-fr.nuke.
* There are some fairly wonky ideas floating about: for example, JSON doesn't have comments (ie the programming language feature) and it is successful, therefore XML doesn't need them. I would be surprised if any of the same programmers would be as happy to use programming languages with no comment. JSON's lack of comments means its use is restricted to dynamically generated data that disappears on use, as far as I can see: this is fine for JSON but not appropriate for data being maintained over months or years, I'd suggest. JSON is actually more complex (rich) compared to raw XML in terms of its datatypes and it has a tremendous convenience factor compared to XML+Schemas.
For example, the XML design goal "terseness is of minimal importance" was important for getting rid of many kinds of SGML markup minimization for XML, but one of the problems people identify compared to JSON is verbosity. It is looking like XML went too far: the document creator can decide whether they need the redundancy check and visual aid that explicit named end tags provide: nothing is gained by not allowing short end tags like </> as well as full tags, as far as I can see.