CSI Sydney: Character Set Investigation

A tale of drugs and confusion

By Rick Jelliffe
May 12, 2009 | Comments: 5

The scene: a document of pharmaceutical data keeps on displaying capital A circumflex  after each major drug name but before a generated trademark sign.

The problem: the character meant something in some original data, but what? You can never afford to ignore strange characters in mission critical data: they may have significance themselves or expose an underlying transcoding problem you were not aware of. The encoding information about the original 8-bit data is lost.

The approach: First we look at the text and figure out what it could be: a non-breaking space or zero-width non-joiner (ZWNJ) is typographically likely, and becomes the working theory.

 is the Unicode character U+00C2. However, characters in the Latin 1 block are prone to shadowing with error characters from other 8-bit character sets being read in by a system using the default encodings of most PCs (CP-1252, ISO8859-1). So we looked at what 0xC2 character is in common character sets, using the handy tables at Unicode Consortium

First we look at MacRoman. Data from around 1985 to 1995 could have used that encoding. But 0xC2 is the not symbol. No good.

If it isn't a Mac issue, the next most likely issue is that it is Adobe-related, since they are also very popular in publishing. Dingbats 0xC2 is a circled digit 3. Stdenc 0xc2 is an acute accent. Symbol font 0xC2 is a bold Fraktur R. Aha...probably not a spacing issue at all.

The solution second theory: A bold Fraktur R was being used sometimes as the registered trademark symbol. The programmer confirmed that the current transformations didn't cope with this, but that they would be transferred. So it looks like the circled R ® (U+00AE) should be used.

You might also be interested in:


Rick hi

Truly, it's moments like this which make life worth living :-)

- Alex (not entirely joking either).

Sounds like it could be a UTF-8 issue instead. U+00AE encoded as UTF-8 is the two bytes 0xC2 0xAE. If you then decode it as two ISO-8859-1 characters, you end up with "®".

Philip: Yes! The plot thickens...we need to go back to the original data to confirm which is correct.

Nowadays, looking at UTF-8 problems should probably be first on the list of things.

In fact, all the Unicode character U+00B0 to U+00CF have the property that this transcoding error would make the correct glyph appear after a Â. For the U+00D0 to U+00FF the inital character is a à followed by various characters. That is a neat property I was not aware of. It may have misdirected me!

How appropriate that there should be a character set error for this article. I'm getting a capital A with a tilde or an acute accent instead of the trademark sign.

David: The only (R) is on the last line. The web page seems to be sent with the correct encoding labelling in the tag. It should be showing up in your browser encoding as UTF-8. Some browsers get set to force CP1252 encoding, in which case certainly you will get two characters for the ®

News Topics

Recommended for You

Got a Question?