Syntax coloring utility

By Kyle Dent
April 19, 2010 | Comments: 4

I often write HTML pages or documentation that includes code samples. When the code is presented this way, it's much easier to follow if it features syntax highlighting. I had found a script that could highlight Perl code, and then I realized I needed the same thing for C code as well. I poked around online for a bit and managed to find a script to highlight the C code, but I really didn't want to reference two scripts, and I knew that later I would be needing the same thing for other languages too. I had a scratch that needed itching and in case anyone else has a similar need, I've posted a new script on my web site that inserts HTML markup into source code files to provide colored syntax highlighting. This script is easily extensible to include most any kind of programming language with all of the languages defined at the end of the script. It separates language keywords into statements that are used for program control and identifier words that are used for declaring and defining identifiers. If you have a new language to add, simply follow the format of the other languages including tokens for special values, comments, and quote characters.

You can download the script and even try it out at

I wrote this script in Perl but thought that I might like to change to something else later so it's not very 'Perl-y' in style. It is, however, a classic finite-state parser with a simple lexer to find individual tokens in source code. It's also short and straightfoward in its approach, so if you want an example of this kind of thing, it could be worth a look. The first part of the script reads the command line and the language definitions. Following that, the action begins and consists of a single while loop that repeatedly grabs the next token from the lexer, then invokes the state-transition function providing the current state and the current token.

The good stuff mostly happens within the state-transition function next_state(). When it finds a keyword or comment or something else it's interested in, it injects the appropriate HTML markup for color highlighting. Everything else gets passed through as is or in the case of ampersands, greater than, and less than characters gets encoded for display as HTML. The lexer reads one character at a time and builds up a token from the input stream with some special handling for C-style comments (which are used by several languages).

Here's an example of what it does on a snippet of itself:

# FSM Sigma:

my $cur_token;
my $cur_state = $START;
while ( defined ($cur_token = lexer($fh)) ) {
if ( defined $$statements{$cur_token} ) {
$cur_state = next_state($cur_state, "STATEMENT", $cur_token);
} elsif ( defined $$identifiers{$cur_token} ) {
$cur_state = next_state($cur_state, "IDENTIFIER", $cur_token);
} elsif ( defined $$opencomment{$cur_token} ) {
$cur_state = next_state($cur_state, "COMMENT", $cur_token);
} elsif ( defined $$quote{$cur_token} ) {
$cur_state = next_state($cur_state, "QUOTE", $cur_token);
} elsif ( defined $$values{$cur_token} ) {
$cur_state = next_state($cur_state, "VALUE", $cur_token);
} else {
$cur_state = next_state($cur_state, "TOKEN", $cur_token);

Let me know if you add any languages and I'll incorporate them into the script.

UPDATE: 4/21/10

From a few comments mentioning other tools, I realize I should have mentioned my requirements for this utility. Primarily it has to work within a pipeline of document creation tools. At a minimum it should be able to read the source code file from the standard input and write to the standard output.

I also wanted something that could be easily distributed (one file) and installed on different platforms, including Windows. My script does require Perl, but as I mentioned it was written to be easily ported to C, for example, which could be compiled and distributed for Windows users.

Besides working in a pipeline and easy distribution, I wanted something that is very easy to add new languages to. The original version of my script defined languages using Perl data structures. I changed that bit to use a DATA section within the script that uses a much simpler properties-type definition. I purposely traded off completeness for simplicity.

If I had found the code2html tool (thanks, Glen), I might have gone with that; however, adding new languages is a bit complicated. On the other hand, it already has a load of languages and is quite comprehensive. It also meets my requirement to be easily distributed. Like I said, that one probably would have worked for me.

You might also be interested in:


The GNU program source-highlight does tons of languages, or you can save as HTML from the Kate KDE editor.

Use AsciiDoc with Source-highlight. Source-highlight can highlight ~130+ different languages/file formats.


Vim will also let you save syntax highlighted code as HTML. No need to reinvent the wheel...

News Topics

Recommended for You

Got a Question?