Validating Code Lists with Schematron

By Rick Jelliffe
November 13, 2008 | Comments: 5

How happy the man whose documents are clearly divided into variant and invariant: data versus schemas.

But in the real world, often there are data values or structures which have fixed choices, but not completely fixed: a twilight zone. For example, the values of a field with codes for different nations may vary independently of the schema which requires such codes be used: think of the political roil of Eastern Europe at the end of the cold war.

If the schema enumerates the allowed codes, then it will need to be updated to track the actual values which requires an ongoing effort and creates a deployment/update aspect; but if the schema just gives some lesser requirement, such requiring a token, developers need to develop some alternative mechanism for validating and documenting the constraints.

XSD and Code Lists

But there are more subtle and potentially catastrophic issues at play. If it is decided to update the schema by merely adding the new codes without removing old ones, that removes a check of incorrect data values. If, however, it is decided to put out a new version of the schema, then documents clearly need to signal which version of the schema they are supposed to accord to. And XML Schemas type derivation mechanism may get in the way, if the type derivation mechanism was not set up correctly: the correct method being a base type using tokens with derived types with the actual enumerated values as restrictions; type binding being done using the base general schema rather than any particular restricted one.

Furthermore, XSD is very frequently compiled rather than used for dynamic validation. So the option of merely having the code list in a separate namespace and schema module (administered separately and imported as needed) is not available.

Schematron and Hard-coded Code Lists

There are two basic ways of handling code lists with Schematron.

The first is to enumerate the values into a test.

 <rule context="country" > 
      <assert test=".= 'AU' or .= 'AZ' or ...">
      A country element should be a standard two-digit code.
      </assert>
 </rule>

In XPath as used by default in ISO Schematron, the dot . means the value of the current node, in this case it the value of each country element in the document, one at a time. So to test an enumeration, simply have a series of string tests, connected by the logical connector or.

Of course, you could transform from an external code list into ISO Schematron (or XSD for that matter) quite easily in most cases. Very often code lists are maintained in some non-schema format, and will need this conversion.

Future-proofing: errors, warnings, cautions, notes

What about future-proofing this? Where you know the values allowed now, but expect them to change in some way in the future? You could handle this using the difference between assert elements and report elements. (In the example below, we are using a variable theCode to simplify access to the value.)

<rule context="country">
      <let name="theCode" value="normalize-space(.)" />
      <assert test="string-length( $theCode )=2" role="error">
      The country element should contain a two-letter country code.
      </assert>
      <report test="not($theCode= 'AU' or $theCode='AZ' or ...)"  role="warning">
      A country element has been used that is not a standard two-digit code as at 2008-12-01.                
      </report>
</rule>

In this example, the assert element tests the weaker constraint that the value of the country element should be two characters long; a full version would include other constraints on tokens; this assert element has a role attribute that marks the assertion with the token "error". A report element reports when a value that is not on the code list is used; this report element has a role attribute that marks the assertion with the token "warning".

Now "warning" and "error" are not built-into ISO Schematron: they are keywords decided by the creator of the schema. And a Schematron implementation will pass these labels on to for example the SVRL output (SVRL=Schematron Validation Report Language). This provides all the information needed for the user or user agent to act correctly.

User selection: phases

Another approach is to use phases to select the code list. In the following example, we have one pattern that allows any code, then other patterns to exclude codes that are not appropriate. Lets say that in 2009, the country of Freedonia (FA) is founded and recognized. You can see this as akin to derivation by restriction, if that is your mental bent.

  <phase name="Countries-2008">
      <active pattern="any-code"/>
      <active pattern="only-2008-codes" />
   </phase>
   <phase name="Countries-2009">
      <active pattern="any-code />
      <active pattern="only-2009-codes" />
    </phase>
...
 <pattern name="any-code">
     <rule context="country" >
           <assert test=".= 'AU' or .= 'AZ' or ... or 'FA'">
           A country element should be a standard two-digit code from the standard list.
           </assert>
     </rule>
 </pattern>
 <pattern name="only-2008-codes">
     <rule context="country">
           <assert test="not(. = 'FA')">
           Freedonia was not a country before 2009 and should not be used.
           </assert>
     </rule>
 </pattern>

In this case, the user or agent is given the choice between which phase to use.

Dynamic exceptions

A better method is for the document itself to contain enough information to know which constraints to check: this is highly dynamic! Lets say that the top-level element in the instance document has an attribute with the year. This allows a schema such as the following

<rule context="country"> 
     <assert test=".= 'AU' or .= 'AZ' or ... or 'FA'">
     A country element should be a standard two-digit code from the standard list.
     </assert>
     <assert test="/*/@year &gt;= 2009 or not(. = 'FA')">
     Freedonia was not a country before 2009 and should not be used.
     </assert>
</rule>

 

Datatype and exceptions in different patterns

The second assertion is perhaps a little convoluted. Some people or situations may favour a solution that makes each exception explicit, such as the following:

<pattern name="general-datatype">
<rule context="country">
           <assert test=".= 'AU' or .= 'AZ' or ... or 'FA'">
           A country element should be a standard two-digit code from the standard list.
           </assert>
</rule>
</pattern>
<pattern name="yearly-exceptions">
<rule context="country[/*/@year &lt;=2007]">
           <assert test="not(. = 'FA')">
           Freedonia was not a country before 2009 and should not be used.
           </assert>
     </rule>
</pattern>

Explicit enumeration

Some may even prefer explicit enumeration:

<pattern name="country-enumerations">
<rule context="country[/*/@year="2008]">
           <assert test=".= 'AU' or .= 'AZ' or ...">
           A country element should be a standard two-digit code from the standard list.
           </assert>
</rule>
<rule context="country[/*/@year="2009]">
           <assert test=".= 'AU' or .= 'AZ' or ... or 'FA'">
           A country element should be a standard two-digit code from the standard list.
           </assert>
</rule>
 
</pattern>

 

Schematron and External Lists

One of the most powerful functions that can be allowed in XPaths is the document() function. A typical way to use these is to read the external document into a variable and then dig inside that variable for particular constraints. For example, say we have an external XML document called codes.xml with a code list such as the following:

<countries>
     <country code="AU">Australia</country>
     <country code="AT">Austria</country>
           ...
</countries>

In this case our assertions becomes something like the following:

<pattern name="country-tests">
     <let name="codes" value="document('codes.xml')" />
           <rule context="country">
           <assert test="$codes//code[normalize-space(.) = normalize-space(current()/.)]" >
           The country element should contain a country code.
           </assert>
     </rule>
</pattern>

One gotcha for the use of document() is that not all implementations may provide the expected base for relative URLs: using absolute URLs is reliable however. And accessing files in the local file system means using the file: URI scheme which can have involve some fiddles under Windows, due to the device component of absolute file paths under Windows, such as the c: prefix. If the external list is in the same ZIP package, then you will need an implementation which supports the zip:, jar: or pack: URL schemes.

Using a variable

Note the use of the current() function to allow comparison in the middle of the XPath with the original context. It is an idiom that I use often, however, some people favour using an explicit variable, such as the following:

<pattern name="country-tests">
     <let name="codes" value="document('codes.xml')" />
  
     <rule context="country">
            <let name="this" value="normalize-space(.)" />
           
            <assert test="$codes//code[normalize-space(.) =$this ]" >
           The country element should contain a country code.
           </assert>
     </rule>
</pattern>

Use this kind of pattern when the code list is not very static and is maintained centrally or by other people or in a different format than the schema language.

A complex example

Here is a real example, from the world of geo-sciences. (This example actually has taken months to get right, as it turned out have a few intricacies!) Indeed, the original requestor decided to move directly to using XSLT, which is of course always an option. There are a few kinds of constraints which Schematron using XSLT1 is simply not powerful enough to express: XSLT2 is much more powerful but even with XSLT2 sometimes capturing and understanding the constraint is not trivial, and a legitimate stumbling block.

The code list

The code list uses is a format called CT_CodelistCatalogue in a file anzlic-theme.xml. Here is a typical fragment, with elisions:

<CT_CodelistCatalogue xmlns="http://www.isotc211.org/2005/gmx" 
     xmlns:gmx="http://www.isotc211.org/2005/gmx"
     xmlns:gco="http://www.isotc211.org/2005/gco" 
     xmlns:gml="http://www.opengis.net/gml"  >
     <!--=====Catalogue description=====-->
     <name>
           <gco:CharacterString>ANZLIC search words</gco:CharacterString>
     </name>
      ...
     <!--============================= Codelists =======================================-->
     <codelistItem>
           <CodeListDictionary gml:id="anzlic-theme">
                 <gml:description>Codelists for thematic classification of resources, as defined by Australia New Zealand Land Information Council</gml:description>
                 <gml:identifier codeSpace="http://asdd.ga.gov.au/asdd/profileinfo/anzlic-theme.xml">anzlic-theme</gml:identifier>
                 <codeEntry>
                       <CodeDefinition gml:id="AGRICULTURE"> 
                             <gml:description>AGRICULTURE non-specific</gml:description>
                             <gml:identifier codeSpace="http://asdd.ga.gov.au/asdd/profileinfo/anzlic-theme.xml">AGRICULTURE</gml:identifier>
                       </CodeDefinition>
                 </codeEntry>
                 <codeEntry>
                       <CodeDefinition gml:id="AGRICULTURE-Crops">
                             <gml:description>AGRICULTURE Crops</gml:description>
                             <gml:identifier codeSpace="http://asdd.ga.gov.au/asdd/profileinfo/anzlic-theme.xml">AGRICULTURE-Crops</gml:identifier>
                       </CodeDefinition>
                 </codeEntry>
                 <codeEntry>
                       <CodeDefinition gml:id="AGRICULTURE-Horticulture">
                             <gml:description>AGRICULTURE Horticulture</gml:description>
                             <gml:identifier codeSpace="http://asdd.ga.gov.au/asdd/profileinfo/anzlic-theme.xml">AGRICULTURE-Horticulture</gml:identifier>
                       </CodeDefinition>
                 </codeEntry>
                 <codeEntry>
                       <CodeDefinition gml:id="AGRICULTURE-Irrigation">
                             <gml:description>AGRICULTURE Irrigation</gml:description>
                             <gml:identifier codeSpace="http://asdd.ga.gov.au/asdd/profileinfo/anzlic-theme.xml">AGRICULTURE-Irrigation</gml:identifier>
                       </CodeDefinition>
                 </codeEntry>

              ...
</CodeListDictionary> </codelistItem> </CT_CodelistCatalogue>

The codes of interest are //gmx:CodeListDictionary/@gml:id (lets call it the "unique id of a dictionary in the codelist") and //gmx:codeEntry/gmx:CodeDefinition/gml:identifier (lets call it the "identifier text of an entry in the codelist".)

There may be multiple codelist dictionaries, and multiple codelists, but they are not cross-linked in any way that is material to us here.

The metadata instance

Then we have another file ANZ0001.xml containing various kinds of metadata (for example, metadata concerning a geographical survey.)

 <gmd:MD_Metadata>
  ...
  <gmd:
descriptiveKeywords
> lang=EN-US
           <gmd:MD_Keywords>
              <gmd
:keyword>
                 <gco:CharacterString>MARINE</gco:CharacterString>
              </gmd:keywordMARINE Coasts</gco:CharacterString>
              </gmd:keyword>
              <gmd:keyword>
                 <gco:CharacterString>OCEANOGRAPHY Physical</gco:CharacterString>
              </gmd:keyword>
              <gmd:type>
                 <gmd:MD_KeywordTypeCode
                    codeList="http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#MD_KeywordTypeCode"
                    codeListValue="theme">theme</gmd:MD_KeywordTypeCode>
              </gmd:type>
              <gmd:thesaurusName>
                 <gmd:CI_Citation>
                    <gmd:title>
                       <gco:CharacterString>ANZLIC Search
                          Words</gco:CharacterString>
                    </gmd:title>
                    <gmd:date>
                       <gmd:CI_Date>
                          <gmd:date>
                             <gco:Date>2008-05-16</gco:Date>
                          </gmd:date>
                          <gmd:dateType>
                             <gmd:CI_DateTypeCode
                                codeList="http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#CI_DateTypeCode"
                                codeListValue="revision"
                                >revision</gmd:CI_DateTypeCode>
                          </gmd:dateType>
                       </gmd:CI_Date>
                    </gmd:date>
                    <gmd:edition>
                       <gco:CharacterString>Version 2.1</gco:CharacterString>
                    </gmd:edition>
                    <gmd:editionDate>
                       <gco:Date>2008-05-16</gco:Date>
                    </gmd:editionDate>
                    <gmd:identifier>
                       <gmd:MD_Identifier>
                          <gmd:code>
                             <gco:CharacterString>
 
http://asdd.ga.gov.au/asdd/profileinfo/anzlic-theme.xml#anzlic-theme
   </gco:CharacterString> 
                          </gmd:code>
                       </gmd:MD_Identifier>
                    </gmd:identifier>
                    <gmd:citedResponsibleParty>
                       <gmd:CI_ResponsibleParty>
                          <gmd:organisationName>
                             <gco:CharacterString>ANZLIC the Spatial
                                Information Council</gco:CharacterString>
                          </gmd:organisationName>
                          <gmd:role>
                             <gmd:CI_RoleCode
                                codeList="http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#CI_RoleCode"
                                codeListValue="custodian">custodian</gmd:CI_RoleCode>
                          </gmd:role>
                       </gmd:CI_ResponsibleParty>                        
                    </gmd:citedResponsibleParty>
                 </gmd:CI_Citation>
              </gmd:thesaurusName>
           </gmd:MD_Keywords>
        </gmd:descriptiveKeywords>

 ...

  </gmd:MD_Metadata>

The codes of interest are //gmd:MD_Keywords/gmd:keyword/gco:CharacterString (lets call it the "keywords used in the metadata") and //gmd:MD_Keywords/gmd:thesaurusName/gmd:CI_Citation/gmd:identifier/gmd:MD_Identifier/gmd:code/gco:CharacterString is the URI for the keyword file.

This latter string is actually a kind of URI fragment, with two parts: the part before a # is the URL of a keyword file, and part after the # is a fragment identifier lets call it a "dictionary reference in the metadata".

There may be multiple metadata files, which are not cross-linked in any way that is material to us here.

The constraint

Now that we have clearly identified which information is being used, the next step is to figure out which document is being validated. This is sometimes not an obvious step to make: for example, in this example do we want to validate the code list file or the metadata file? If we wanted to test, for example, that every entry in the codelist is actually used somewhere, we would be validating the code list. But if we want to check that each keyword in the metadata has a corresponding entry, we are validating the metadata instance.

And, indeed, we do want to validate that.

So lets make up some suitable assertion text. Each keyword in the metadata should have a corresponding definition in the codelist specified by the thesaurus identifier.

 <pattern>
      <rule context="gmd:MD_Keywords/gmd:keyword/gco:CharacterString">
           <let name="URI-reference" value=
            "../../gmd:thesaurusName/gmd:CI_Citation/gmd:identifier/gmd:MD_Identifier/gmd:code/gco:CharacterString" />    
           <let name="URI" value= "substring-before( $URI-reference, '#')" />
             <let name="fragment" value= "substring-after( $URI-reference, '#')" />
             <let name="code-list-document" value="document( $URI )" />
             <let name="dictionary" value=" $code-list-document/ gmx:CodeListDictionary[@gml:id = $fragment ]" />
             <assert test=" $dictionary/gmx:codeEntry/gmx:CodeDefinition[gml:identifier = current()]" > 
             Each keyword in the metadata should have a corresponding entry in the codelist     
              specified by the thesaurus identifier
               </assert>
     </rule>
</pattern>
 

Interestingly, when writing and debugging the code in this article, by far the most tricky bugs to find were due to the lack of a systematic case rule for the element names.

More like it...

Now in testing, it turns out that our constraint is a little more complicated. It seems that sometimes people mark up the metadata using the identifier text of the codelist, but other times they use the actual text description in code list. Frequently these are the same, or one is just a hyphenated (tokenized) version of the other. So lets expand our assertion a little. Again, we make good use of variables, but we also test the variable: best practise is to have an assertion for each variable, just as a matter of basic unit-testing and to allow debugging: it is prudent to report whether the codelist file and dictionary could in fact be located.

This example gives a complete schema, not just a fragment.

<schema  xmlns="http://purl.oclc.org/dsdl/schematron"  >

     <title>Example of a Schematron schema for checking that 
codesare drawn from codelists specified in the document being
validated</title>

     <ns prefix="gmx" uri="http://www.isotc211.org/2005/gmx"  /> 
     <ns prefix="gco" uri="http://www.isotc211.org/2005/gco"  />
     <ns prefix="gmd" uri="http://www.isotc211.org/2005/gmd"  />
     <ns prefix="gml" uri="http://www.opengis.net/gml"  />

<pattern>

        <rule
             context= "gmd:MD_Keywords/gmd:keyword/gco:CharacterString"  >

            <let name= "URI-reference" value=
"../../gmd:thesaurusName/gmd:CI_Citation/gmd:identifier/gmd:MD_Identifier/gmd:code/gco:CharacterString" />    

             <let name="URI" value= "substring-before( $URI-reference, '#')" />

            
              <let name="fragment" value= "substring-after( $URI-reference,
'#')" />

              
              <let name="code-list-document" value="document( $URI )"/>

          
              <let name="dictionary" value=" $code-list-document//gmx:CodeListDictionary[@gml:id = $fragment ]" />

 

            <assert test=" string-length(normalize-space( $URI )) &gt; 0" role="debug" >
             The codelist file  URI should not be empty
             </assert>

             <assert  test=" string-length(normalize-space( $fragment )) &gt; 0" role="debug">
             The fragment identifier in a codelist URI should not be empty
             </assert>

             <assert test="$dictionary" role="debug">
             The codelist URI should identify a loadable dictionary.
             </assert>

             <assert test=" $dictionary/gmx:codeEntry/gmx:CodeDefinition[gml:identifier = current()]
              or
            $dictionary/gmx:codeEntry/gmx:CodeDefinition[gml:description= current()]" > 
            Each keyword in the metadata should have a corresponding entry in the codelist
            specified by the thesaurus identifier.
             </assert>
       </rule>
</pattern>

</schema>

Note that what we have not done is to use the MD_Keywords element as our context, and then attempt to iterate through each keyword in assertions. It may be possible to do this using XSLT2, however it is certainly not possible in XSLT1 to iterate through multiple lists.

Reducing file loads?

All well and good, but this will run a fairly slowly because each keyword in the metadata file causes the codelist to be reloaded. Is there any way of speeding this up, potentially? For a start, we could use XSLT2, which has a doc() function with sticky semantics.

If we have a fixed list of code files, then we can have a separate pattern to test against each codelist. (We might also put in another pattern

 <pattern id="anzlic-theme">
      <let name="URI" value= "'http://asdd.ga.gov.au/asdd/profileinfo/anzlic-theme.xml'" />
 
      <rule context="gmd:MD_Keywords[  starts-with( 
             gmd:thesaurusName/gmd:CI_Citation/gmd:identifier/gmd:MD_Identifier/gmd:code/gco:CharacterString, $URI )
]/gmd:keyword/gco:CharacterString">
     <let name="URI-reference" value=
      "../../gmd:thesaurusName/gmd:CI_Citation/gmd:identifier/gmd:MD_Identifier/gmd:code/gco:CharacterString" />    
     <let name="fragment" value= "substring-after( $URI-reference, '#')" />
     <let name="code-list-document" value="document( $URI )" />
     <let name="dictionary" value=" $code-list-document//gmx:CodeListDictionary[@gml:id = $fragment ]" />
             
           <assert test=" string-length(normalize-space( $URI )) &gt; 0" >
           The codelist file  URI should not be empty.
            </assert>
 
           <assert test=" string-length(normalize-space( $fragment )) &gt; 0" >
           The fragment identifier in a codelist URI should not be empty.
            </assert>
 
           <assert test="$dictionary" >
           The codelist URI should identify a loadable dictionary.
           </assert>
 
            <assert test=" $dictionary/gmx:codeEntry/gmx:CodeDefinition[gml:identifier = current()] or
  $dictionary/gmx:codeEntry/gmx:CodeDefinition[gml:description = current()]" > 
 Each keyword in the metadata should have a corresponding entry in the codelist specified by the thesaurus identifier.
            </assert>
     </rule>
</pattern>
(Thanks to John Hockaday for raising this example.)

You might also be interested in:

5 Comments

Excellent posting Rick!

Schematron is an excellent way to capture business rules. The way Schematron expresses its rules with simple XPath expressions is very elegant. An XForms interface to Schematron would be very useful to make it accessible to more people.

You should write a book on Schematron!

- Dan

Dan: Thanks.

Several books have chapters on Schematron. Eric van der Vlist has a book from O'Reilly's Shortcut series, available as PDF for $9.99 that will be especially good for people from an XSLT background.
http://oreilly.com/catalog/9780596527716/index.html
(51 pages.)

Also, the ISO standard itself is available for free download (30 pages). It is not a tutorial, but I think is not bad. http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html

Rick,

I needed to change the code in "External lists" to make it work.

<schema xmlns="http://purl.oclc.org/dsdl/schematron">
<pattern>
<let name="codes" value="document('codes.xml')" />
<rule context="country">
<assert test="$codes//@code[normalize-space(.) = normalize-space(current()/.)]" >
The country element should contain a country code.
</assert>
</rule>
</pattern>
</schema>

What am I missing in the original example?

Paul H: Good catch. I have fixed this in two occurrences.

Don't forget the work of OASIS standardization in the area of code lists undertaken by the OASIS Code List Representation Technical Committee:

http://www.oasis-open.org/committees/codelist

In that committee we have standardized an XML representation of code lists:

- genericode 1.0 - lists of codes with list-level and code-level meta data

http://docs.oasis-open.org/codelist/genericode

Underway is the development of expressing, in XPath, the context of the use of code lists:

- context/value association using genericode 0.5 draft 1

http://www.oasis-open.org/committees/document.php?document_id=29990

And to the point of your post, Rick, there is an off-the-shelf Schematron-based implementation of validation using CVA files:

http://www.cranesoftwrights.com/resources/ubl/index.htm#cva2sch

I think Schematron is ideal for the checking of code lists, and those who need and work with code lists should consider standardized and portable approaches to expressing those requirements in XML documents.

News Topics

Recommended for You

Got a Question?