XSLT-based XHTML Markup Sanitizer

By M. David Peterson
October 13, 2008

I've been meaning to write an XSLT-based XHTML markup sanitizer for a while now and tonight discovered I needed it sooner rather than later. In case you find benefit from it, here it is:


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:lookup="http://xameleon.org/lookup" xmlns="http://www.w3.org/1999/xhtml" xmlns:html="http://www.w3.org/1999/xhtml" version="1.0" exclude-result-prefixes="html lookup">
    <lookup:html>
        <html:p use="p"/>
        <html:em use="em"/>
        <html:strong use="strong"/>
        <html:b use="strong"/>
        <html:i use="em"/>
        <html:blockquote use="blockquote"/>
        <html:cite use="cite"/>
    </lookup:html>
    <xsl:variable name="safe-elements" select="document('')//lookup:html/*"/>
    <xsl:template match="/">
        <div>
            <xsl:apply-templates mode="validate"/>
        </div>
    </xsl:template>
    <xsl:template match="html:div" mode="validate">
        <xsl:apply-templates select="*|text()" mode="validate"/>
    </xsl:template>
    <xsl:template match="*" mode="validate">
        <xsl:variable name="local-name" select="local-name()"/>
        <xsl:apply-templates select="$safe-elements[local-name() = $local-name]/@use" mode="safe">
            <xsl:with-param name="node" select="."/>
        </xsl:apply-templates>
    </xsl:template>
    <xsl:template match="text()" mode="validate">
        <!-- You could do some extended text matching here to remove any text seen as undesirable -->
        <xsl:value-of select="."/>
    </xsl:template>
    <xsl:template match="@*" mode="safe">
        <xsl:param name="node"/>
        <xsl:element name="{.}">
            <xsl:apply-templates select="$node/*|$node/text()" mode="validate"/>
        </xsl:element>
    </xsl:template>
</xsl:stylesheet>

To avoid copy/pasting escaped markup, you can snag the same code from monoport.

To adapt to your specific needs, use the //lookup:html table to define which elements are okay and, if yes, what element name to map it to in the output. e.g. html:b becomes html:strong, html:i becomes html:em, and so forth.

The above code assumes all attributes are /evil/.

Enjoy!


You might also be interested in:

News Topics

Recommended for You

Got a Question?