A reader asked me about some recent vague press items about newly discovered security flaws in some XML parsers. Topologi makes a kind of firewall validator for XML called Interceptor, and since security is one of the applications of validation it is an area I need to be more aware of.
A good starting point of information is to search for keyword 'XML' on the US NIST's National Vulnerabilities Database. Other material on the web is self-serving promotion or too vague to be any use; indeed some material seems to misreport the ramifications or perhaps play free and easy in the interests of a more sensational story. But even then you need to dig around a bit...
The recent Denial of Service attack vulnerabilities relating to XML seem to be:
- Some XML parsers do not detect recursive loops in parameter entities, and hang or loop. Recursive entity references are banned in XML (and explicitly mentioned in RFC 2376 so this seems a good thing to fix
- Some XML parsers or systems will crash or hang if you send a document that is too big. Again, RDF 2376 mentions the finite resources of computers.
- Some XML parsers or systems are not written to properly yield to other processes, with certain data (in particular, long strings of consecutive start tags.)
- Some systems are not written to cope with a non-WF or invalid document. This is of course a programming problem, not ultimately an XML one. (It is interesting that there is a class of exploits where the intent is to glean sensitive information from diagnostic or correction responses.)
Is a large document really a potential Denial of Service attack? Well, yes but obviously is not an issue unique to XML. I think Java has a particular weakness or strength here, because of its strange heap allocation rules. The JVM needs to be tuned to cope with the maximum expected size of incoming documents, and if the system is in an exposed-to-attack position, it may be well to run the servlet as a separate JVM so that the DOS attack only succeeds on that servlet.
I found Java was quite frustrating in this regard (it may have improved recently). When you get an out-of-memory error exception, it is difficult to do something useful with it. For example, should we be writing our code with a SAX handler that can abort a read if some global flag (or some listener system, if you think you can trust that) was set by the out-of-memory error's exception handler?
One technique is John Boyland's Hedge where you pre-allocate a chunk of memory for a rainy day, then release it on an out-of-memory exception: this can reduce thrashing and give a little more memory at best to tide you over the problem, and at worst to safely fail or recover. Another approach for Java is to use NIO for external buffers as much as possible, so that large incoming files don't use up heap space in the JVM. And another technique, which is so often ignored by Java programmers, is that whenever you have an object or initial reference to a graph of objects that has the potential to get large, you should explicitly null the references as soon as the object is no longer needed, and not leave it to Java's automatic garbage collection: an object will stay referenced according to the block scope, so it may be unrecoverable for many lines of code.
The issue with the start tags is interesting. I think most people writing serious desktop applications in Java come across the problem sooner or later that so many Java libraries are written for server applications, and can hog resources. It is not limited to Java, as users of Firefox (and its occasional hangs, relating to blocking IO) can attest: the vulnerability was found in IE 6
Microsoft Internet Explorer 6 through 6.0.2900.2180 and 7 through 7.0.6000.16473 allows remote attackers to cause a denial of service (CPU consumption) via an XML document composed of a long series of start-tags with no corresponding end-tags
Here is some example code for generating such a document that caused some kind of problem with Opera (now fixed):
exploit = '<A>' * 7400
exploit = '<xml>' + exploit + '</xml>'
Every few months I send around an email to the programmers at work, reminding them to make their Java libraries "good citizen". In particular, this means knowing where in their code there could be long processing runs that could hog resources: putting in explicit Thread.yield() every 1000 iterations of some CPU-intensive loops for example can make GUI responses much smoother, unblock other threads, and even help with garbage collection. And putting in actual small delays (not CPU-wasting dummy loops of course!) in code which accesses shared congestable resources such as disks and file systems prevents your process from acting like, in effect, a virus.
It is interesting to read comments from vendors that a particular problem may make the system vulnerable to "XML fuzzing attacks". But I don't see that "XML fuzzing" (systematically generating documents with random or systematic flaws and seeing if this exposes any problem) is of itself an attack: it just may expose a method of attack.
It strikes me that this would be a good subject for an abstract test suite: or at least a recipe that testers can use to generate some general test cases. Without claiming to be exhaustive:
- Does the application respond gracefully and safely to missing documents?
- Does the application respond gracefully and safely to zero size documents?
- Does the application respond gracefully and safely to documents with a well-formedness mistake at the beginning?
- Does the application respond gracefully and safely to documents with a well-formedness mistake at the end?
- Does the application respond gracefully and safely to oversized documents?
- Does the application respond gracefully and safely to multiple simultaneous large documents, which are together larger than max JVM memory / 2? Does this cause thrashing?
- Does the application respond gracefully and safely to infinite documents?
- Does the application reject gracefully and safely documents with DTDs? (not always appropriate)
- The good, the bad and the ugly: Does the application respond gracefully and safely to documents in UTF-8, CP-1252 and some unexpected encoding perhaps Big5 for Westerners?
Readers with pointers to more practical info (no product references please!) are welcome to add comments.