Following is an excerpt from Masterminds of Programming, by Federico Biancuzzi and Shane Warden. (Adapted for the web).
The Unix philosophy of many small tools, powerful in their combination, is evident in the AWK programming language. Its inventors (Al Aho, Peter Weinberger, and Brian Kernighan) describe it as a language for syntax-driven pattern matching. Its straightforward syntax and clever selection of useful features make it easy to slice and dice text through one-liners without having to understand parsers and grammars and finite automata. Though its inspiration has spread to general-purpose languages such as Perl, any modern Unix box still has AWK installed and quietly, effectively, working away.
Breeding Little Languages
What hooked you on programming?
Brian Kernighan: I don't really recall any specific event. I didn't even see my first computer until I was about a junior in college, and I didn't really learn to program (in FORTRAN) until a year or so after that. I think that the most fun I had programming was a summer job at Project MAC at MIT in the summer of 1966, where I worked on a program that created a job tape for the brand new GE 645 in the earliest days of Multics. I was writing in MAD, which was much easier and more pleasant than the FORTRAN and COBOL that I had written earlier, and I was using CTSS, the first time-sharing system, which was infinitely easier and more pleasant than punch cards. That was the point where the puzzle-solving aspects of programming became really enjoyable, because the mechanical details didn't get in the way nearly so much.
How do you learn a new language?
Brian: I find it easiest to learn a new language from well-chosen examples that do some task that's close to what I want to do. I copy an example, adapt it to what I need, then expand my knowledge as the specific application drives it. I poke around in enough different languages that after a while they start to blur, and it takes a while to shift gears when I shift from one to another, especially if they are not ones like C that I learned long ago. It's good to have good compilers that complain about suspicious constructions as well as illegal ones; languages with strong type systems like C++ and Java are helpful here, and the options that enforce strict conformance to standards are good, too.
More generally, there's nothing like writing a lot of code, preferably good code that other people use. Next best, though less frequently done, is reading a lot of good code to see how other people write. Finally, breadth of experience helps--each new problem, new language, new tool, and new system helps you get better, and creates links with whatever you know already.
How should a manual for a new programming language be organized?
Brian: A manual should make it easy to find things. That means that the index has to be really good, the tables of things like operators and library functions have to be concise and complete (and easy to find), and the examples should be short and crystal clear.
This is different from a tutorial, which should definitely not be the same as a manual. I think the best approach for a tutorial is a sort of "spiral," in which a small set of useful basic things is presented, but enough to write complete and useful programs. The next rotation of the spiral should cover another level of detail or perhaps alternative ways of saying the same kinds of things and the examples should still be useful but can be bigger. Then put a good reference manual at the end.
Should examples--even beginner examples--include the error-handling code?
Write more code! And then think about the code you wrote and try to rework it to make it better. Get other people to read it too if you can, whether as part of your job or as part of an open source project. --Brian Kernighan
Brian: I'm torn on this. Error-handling code tends to be bulky and very uninteresting and uninstructive, so it often gets in the way of learning and understanding the basic language constructs. At the same time, it's important to remind programmers that errors do happen and that their code has to be able to cope with errors.
My personal preference is to pretty much ignore error handling in the earlier parts of a tutorial, other than to mention that errors can happen, and similarly to ignore errors in most examples in reference manuals unless the point of some section is errors. But this can reinforce the unconscious belief that it's safe to ignore errors, which is always a bad idea.
What did you think of the idea for the Unix manual to cite bugs? Does this practice make sense today, too?
Brian: I liked the BUGS sections, but that was when programs were small and rather simple and it was possible to identify single bugs. The BUGS were often features that were not yet provided or things that were not properly implemented, not bugs in the usual sense of walking off the end of an array or the like. I don't think this would be feasible for most of the kinds of errors one would find in really big modern systems, at least not in a manual. Online bug repositories are a fine tool for managing software development, but it's not likely that they will help ordinary users.
Do current programmers need to be aware of the lessons you collected in your book about programming style, The Elements of Programming Style?
Brian: Yes! The basic ideas of good style, which are fundamentally to write clearly and simply, are just as important now as they were 35 years ago when Bill Plauger and I first wrote about them. The details are different in minor ways, to some extent depending on properties of different languages, but the basics are the same now as then. Simple, straightforward code is just plain easier to work with and less likely to have problems. So as programs get bigger and more complicated, it's even more important to have clean, simple code.
Does the way you can write text influence the way you write software?
Brian: It might. In both text and programs, I tend to work over the material many times until it feels right. There's a lot more of this in prose, of course, but it's the same desire, to have the words or the code be as clear and clean as possible.
How does knowing the problems that software will solve for the user help the developer write better software?
Brian: Unless the developer has a really good idea of what the software is going to be used for, there's a very high probability that the software will turn out badly.
In some fortunate cases, the developer understands the user because the developer is also going to be a user. One of the reasons why the early Unix system was so good, so well suited to the needs of programmers, was that its creators, Ken Thompson and Dennis Ritchie, wanted a system for their own software development; as a result, Unix was just great for programmers writing new programs. The same is true of the C programming language.
If the developers don't know and understand the application well, then it's crucial to get as much user input and experience as possible. It is really instructive to watch new users of your software--within a minute, a typical newcomer will try do something or make some assumption that you never thought of and your program will make their life harder. But if you don't monitor your users when they first encounter your software, you won't see their problems; if you see them later, they've probably adapted to your bad design.
How can programmers improve their programming?
Brian: Write more code! And then think about the code you wrote and try to rework it to make it better. Get other people to read it too if you can, whether as part of your job or as part of an open source project. It's also helpful to write different kinds of code, and to write in different languages, since that broadens your repertoire of techniques and gives you more ways to approach some programming problem. Read other people's code, for example, to try to add features or fix bugs; that will show you how other people approach problems. Finally, there's nothing like teaching others to program to help you improve your own code.
The hardest bugs are those where your mental model of the situation is just wrong, so you can't see the problem at all. --Brian Kernighan
Everyone knows that debugging is twice as hard as writing the software, so how should debugging be taught?
Brian: I'm not sure that debugging can be taught, but one can certainly try to tell people how to do it systematically. There's a whole chapter on this in The Practice of Programming, which Rob Pike and I wrote to try to explain how to be more effective at debugging.
Debugging is an art, but it's definitely possible to improve your skill as a debugger. New programmers make careless mistakes, like walking off the start or end of an array, or mis-matching types in function calls, or (in C) using the wrong conversion characters in printf and scanf. Fortunately, these are usually easy to catch because they cause very distinctive failures. Even better, they are easy to eliminate as you write the code in the first place, by boundary condition checking, which amounts to thinking about what can go wrong as you write. Bugs usually appear in the code you wrote most recently or that you started to test, so that's a good place to concentrate your efforts.
As bugs get more complicated or subtle, more effort is called for. One effective approach is to "divide and conquer," attempting to eliminate part of the data or part of the program so that the bug is localized in a smaller and smaller region. There's also often a pattern to a bug; the "numerology" of failing inputs or faulty output is often a very big clue to what's going wrong.
The hardest bugs are those where your mental model of the situation is just wrong, so you can't see the problem at all. For these, I prefer to take a break, read a listing, explain the problem to someone else, use a debugger. All of these help me to see the problem a different way, and that's often enough to pin it down. But, sadly, debugging will always be hard. The best way to avoid tough debugging is to write things very carefully in the first place.
How do hardware resources affect the mindset of programmers?
Brian: Having more hardware resources is almost always a good thing--it means, for example, that one doesn't have to worry much about memory management, which used to be an infinite pain and source of errors 20 or 30 years ago (and certainly was when we were writing AWK). It means that one can use potentially inefficient code, especially general-purpose libraries, because runtime is not nearly as much of an issue as it was 20 or 30 years ago. For example, I think nothing today of running AWK over 10 or even 100 MB files, which would have been very unlikely long ago. As processors continue to get faster and memory capacities rise, it's easier to do quick experiments and even write production code in interpreted languages (like AWK) that would not have been feasible a few decades ago. All of this is a great win.
At the same time, the ready availability of resources often leads to very bloated designs and implementations, systems that could be faster and easier to use if a bit more restraint had gone into their design. Modern operating systems certainly have this problem; it seems to take longer and longer for my machines to boot, even though, thanks to Moore's Law, they are noticeably faster than the previous ones. All that software is slowing me down.
What is your opinion on domain-specific languages (DSL)?
Brian: I worked on a lot of what are now most often called domain-specific languages, though I usually called them "little languages," and others refer to "application-specific languages." The idea is that by focusing a language on a specific and usually narrow domain, you can make its syntax match the domain well, so that it's easy to write code to solve problems within that domain. There are lots of examples--SQL would be an instance, and of course AWK itself is a fine example, a language for specifying certain kinds of file processing very easily and compactly.
The big problem with little languages is that they tend to grow. If they are at all useful, people want to apply them more broadly, pushing the envelope of what the original language was meant for. That usually implies adding more features to the language. For instance, a language might originally be purely declarative (no if tests, no loops) and it might have no variables or arithmetic expressions. All of those are useful, however, so they tend to get added. But when they are added, the language grows (it's no longer so little), and gradually the language starts to look like any other general-purpose language, but with different syntax and semantics and sometimes a weaker implementation as well.
Several of the little languages I worked on were for document preparation. The first, with Lorinda Cherry, was called EQN, and was for typesetting mathematical expressions. It was pretty successful, and as our typesetting equipment became more capable, I also did a language for drawing figures and diagrams, which was called PIC. PIC started out only able to draw, but it rapidly became clear that it needed arithmetic expressions to handle computations on coordinates and the like, and it needed variables to store results, and it needed loops to create repetitive structures. All of these were added, but each one was kind of awkward and shaky. In the end, PIC was quite powerful, a Turing-complete language, but one wouldn't want to write a lot of code in it.
How do you define success in terms of your work?
Brian: One of the most rewarding things is to have someone say that they used your language or tool and found that it helped them get their job done better. That's really satisfying. Of course it's sometimes followed by a report of problems or of missing features, but even those are valuable.
In which contexts is AWK still powerful and useful?
Brian: AWK still seems to be best for quick and dirty data analysis: find all the lines that have some property, or summarize some aspect of the data, or make some simple transformation on it. I can often get more done with a couple of lines of AWK than others can with 5 or 10 lines of Perl or Python, and empirically, my code will run almost as fast.
I have a collection of small AWK scripts that do things like add up all the fields in all the lines or compute the ranges of all fields (a way to get a quick look at a dataset). I have an AWK program that fills arbitrary text lines into at most 70 character lines that I probably use 100 times a day for cleaning up mail messages and the like. Any of these could be easily written in some other scripting language and would work just as well, but they're easier in AWK.
What should people keep in mind when writing AWK programs?
Brian: The language was originally meant for writing programs that were only one or two lines long. If you're writing something big, AWK might well not be the right language, since it has no mechanisms that help with big programs, and some design decisions can lead to hard to find bugs--for example, the fact that variables are not declared and are automatically initialized is very convenient for a one-line program, but it means that spelling mistakes and typos are undetectable in a big program.
If you enjoyed this excerpt, buy a copy of Masterminds of Programming.