Defining markup languages using Unicode properties

By Rick Jelliffe
December 16, 2008 | Comments: 1

Because of XML's well-formedness rules, we can view an XML document's information set as nice directed graph structure if we want: a tree decorated with attributes, some of which are references to other elements. (Indeed, nowadays we probably are better view it as the the reachable nodes from one root of a multiple-rooted directed graph, if we allow inter document links. But I digress.)

In SGML, XML's big older brother, it is possible to redefine the delimiter set. And it is possible to declare that certain characters are, in a particular context, short cuts to some other tag or data (SHORTREF.) This allows wiki-like markup. And SGML also has an odd feature, only supported by specialist tools, called CONCUR, which meant it was possible to prefix the generic identifiers on elements with another name: unlike namespace prefixes, only one of these names was active at any time. This allowed a kind of concurrent markup: for example one set of tags which describe the paragraphs of a document, another set which describe the line, block and pages of the document, where the two kinds of tags may not necessarily nest into each other.

Over the last few years, the W3C XML Working Group charged with maintaining XML have been trying to cope with how to handle changes to Unicode. XML 1.1 and the feckless XML 1.0 (fifth edition) which has outraged people like Elliotte Rusty Harold and James Clark. [UPDATE: Tim Bray, David Carlisle too.] The problem is how to handle unallocated characters in Unicode that get allocated and may be useful for element names. The existing XML way, which has undoubtedly worked well, is to list the characters; the XML WG is therefore responsible for updating the list. XML 1.1 took the approach of merely listing the characters that were excluded from names, leaving the unallocated characters free to use. XML (fifth edition) adopts this approach: I am not sure why you want to replace a method that has proved itself satisfactory with a method that has proved itself unpopular, but that is not the point of this blog.

There is a third way of deciding which characters are allowed in names. That is to use the Unicode properties. Unicode defines a large number of properties for each character, whether it is a letter, digit or symbol for example. What about when a document requires a more modern version of Unicode than the parser supports? Well, actually parsers normally change faster than the OS or use the OS features. So if a document has name characters that an OS does not recognize, chances are it may not do the right thing with the document anyway. That is just the nature of being at the leading edge.

So this raises the interesting prospect: Can we define a family of markup languages that used the Unicode properties and which could accept a fair imitation of XML and produce a SAX-like event stream?

Lets start of by saying there are three kinds of text in a document:

     text = data + token-tags + complex-tags.  

A token tag is defined as sequence that

  1. Starts with a symbol (that is not a paired symbol)

  2. Has any arbitrary number of (non-paired) symbols or punctuation after it

  3. Has any number of letter/ideograph/digits/punctuation (excluding ;) which we call a token

  4. Ends with a (non-paired) symbol (including ;) or whitespace or other char.

An example of this is { So token-tags cover XML entity references, but also many other things. For example @xxx|

By a paired symbol, I mean a character that is not one with the Unicode properties PUNCTUATION OPEN, PUNCTUATION CLOSED, PUNCTUATION INITIAL QUOTE, PUNCTUATION FINAL QUOTE or (perhaps) BIDI-MIRRORED.

These are used as the delimiter for the complex tags.

A complex tag has the syntax:

  1. Starts with a PUNCTUATION OPEN (or BIDI-MIRRORED character?)

  2. Optionall followed by any number of non-paired symbols

  3. Contains runs of tokens, whitespace, non-paired delimiters, token-tags and nested complex tags.

  4. Ends with the corresponding PUNCTUATION CLOSE character, optionally preceded by any number of non-paired symbols.

So example of this kind of tag are

<!-- -->
<something:somethingelse detail="XXX &amp; YYY" />

but also

struct X { integer: X; integer y}
(eval(unquote(car 'X)))
<para;gt; was a <$/page$><$page$>dark and stormy night</para>

So parsing a document involves first lexing it using these rules to find every occurrence
of a potential token-tag or complex-tag, even if they overlap. These are presented as a stream of potential events in a SAX like stream. Each event is accepted or rejected by the application. If an event is rejected, then the first delimiter is re-presented as a data character, and scanning resumes at the subsequent character. (And this approach is susceptible to the lookahead pipelined block parsing such as Prof Cameron has been working on at Simon Fraser U., where in effect you first scan for potential delimiters without requiring any pipeline-breaking decisions, then parse those points.)

This is, of course, a much lower kind of tokenization than XML provides. But XML could be built on top of this, as could many other languages, including overlapping markup syntaxes. The SGML features could be reformulated in terms of it, and combined tokenizers (e.g. XML + CSS + javascript without checking the details.)

Of course, a restricted version of this would be to say that the only initial symbol for token tags allowed was &, and that the initial delimiter for outer complex-tags was <, with the only initial delimiter for contained complex tags being " or ', in turn recognizing inside these contained complex tags only the outer complex-tag delimiter, and so on. This would give a kind of expanded XML, which allows tags like

<?thing deed?>
<!-- comment -->
<$ some new kind of tag $>
<% some other kind of tag %>

I tend to think this would be a good thing. I know that many people like to say that XML has too many different kinds of markup, but I think it has too few, certainly too few to express the different kinds of datatypes: you cannot even say something is a number or a string! The tag space of other symbols after a < is currently free but not WF. It doesn't have to be that way.

Of course, while these things are possible, it may be an unnecessary adventure.

You might also be interested in:

1 Comment


News Topics

Recommended for You

Got a Question?