Unix Programming - Special-Purpose Code Generators

Special-Purpose Code Generators

Unix has a long-standing tradition of hosting tools that are specifically designed to generate code for various special purposes. The venerable monuments of this tradition, which go back to Version 7 and earlier days, and were actually used to write the original Portable C Compiler back in the 1970s, are lex(1) and yacc(1). Their modern, upward-compatible successors are flex(1) and bison(1), part of the GNU toolkit and still heavily used today. These programs have set an example that is carried forward in projects like GNOME's Glade interface builder.

yacc and lex

yacc and lex are tools for generating language parsers. We observed in Chapter8 that your first minilanguage is all too likely to be an accident rather than a design. That accident is likely to have a hand-coded parser that costs you far too much maintenance and debugging time — especially if you have not realized it is a parser, and have thus failed to properly separate it from the remainder of your application code. Parser generators are tools for doing better than an accidental, ad-hoc implementation; they don't just let you express your grammar specification at a higher level, they also wall off all the parser's implementation complexity from the rest of your code.

If you reach a point where you are planning to implement a minilanguage from scratch, rather than by extending or embedding an existing scripting language or parsing XML, yacc and lex will probably be your most important tools after your C compiler.

lex and yacc each generate code for a single function — respectively, “get a token from the input stream” and “parse a sequence of tokens to see if it matches a grammar”. Usually, the yacc-generated parser function calls a Lex-generated tokenizer function each time it wants to get another token. If there are no user-written C callbacks at all in the yacc-generated parser, all it will do is a syntax check; the value returned will tell the caller if the input matched the grammar it was expecting.

More usually, the user's C code, embedded in the generated parser, populates some runtime data structures as a side-effect of parsing the input. If the minilanguage is declarative, your application can use these runtime data structures directly. If your design was an imperative minilanguage, the data structures might include a parse tree which is immediately fed to some kind of evaluation function.

yacc has a rather ugly interface, through exported global variables with the name prefix yy_. This is because it predates structs in C; in fact, yacc predates C itself; the first implementation was written in C's predecessor B. The crude though effective algorithm yacc-generated parsers use to try to recover from parse errors (pop tokens until an explicit error production is matched) can also lead to problems, including memory leaks.

	If you are building parse trees, using malloc to make nodes, and you start popping things off the stack in error recovery, you don't get to recover (free) the storage. In general, Yacc can't do it, since it doesn't know enough about what's on the stack. If the yacc parser were in C++, it could assume that the values were classes and “destruct” them. In “real” compilers, parse tree nodes are generated using an arena-based allocator, so the nodes don't leak, but there is a logical leak anyway that needs to be thought about to make industrial-strength error recovery.
-- Steve Johnson

lex is a lexical analyzer generator. It's a member of the same functional family as grep(1) and awk(1), but more powerful because it enables you to arrange for arbitrary C code to be executed on each match. It accepts a declarative minilanguage and emits skeleton C code.

A crude but useful way to think about what a lex-generated tokenizer does is as a sort of inverse grep(1). Where grep(1) takes a single regular expression and returns a list of matches in the incoming data stream, each call to a lex-generated tokenizer takes a list of regular expressions and indicates which expression occurs next in the datastream.

	Splitting input analysis into tokenizing input and parsing the token stream is a useful tactic even if you're not using Yacc and Lex and your “tokens” are nothing like the usual ones in a compiler. More than once I've found that splitting input handling into two levels made the code much simpler and easier to understand, despite the complexity added by the split itself.
-- Henry Spencer

lex was written to automate the task of generating lexical analyzers (tokenizers) for compilers. It turned out to have a surprisingly wide range of uses for other kinds of pattern recognition, and has since been described as “the Swiss-army knife of Unix programming”.^[130]

If you are attacking any kind of pattern-recognition or state-machine problem in which all the possible input stimuli will fit in a byte, lex may enable you to generate code that will be more efficient and reliable than a hand-crafted state machine.

	John Jarvis at Holmdel [an AT&T laboratory] used lex to find faults in circuit boards, by scanning the board, using a chain-encoding technique to represent the edges of areas on the board, and then using Lex to define patterns that would catch common fabrication errors.
-- Mike Lesk

Most importantly, the lex specification minilanguage is much higher-level and more compact than equivalent handcrafted C. Modules are available to use flex, the open-source version, with Perl (find them with a Web search for “lex perl”), and a work-alike implementation is part of PLY in Python .

lex generates parsers that are up to an order of magnitude slower than hand-coded parsers. This is not a good reason to hand-code, however; it's an argument for prototyping with lex and hand-hacking only if prototyping reveals an actual bottleneck.

yacc is a parser generator. It, too, was written to automate part of the job of writing compilers. It takes as input a grammar specification in a declarative minilanguage resembling BNF (Backus-Naur Form) with C code associated with each element of the grammar. It generates code for a parser function that, when called, accepts text matching the grammar from an input stream. As each grammar element is recognized, the parser function runs the associated C code.

The combination of lex and yacc is very effective for writing language interpreters of all kinds. Though most Unix programmers never get to do the kind of general-purpose compiler-building that these tools were meant to assist, they're extremely useful for writing parsers for run-control file syntaxes and domain-specific minilanguages.

lex-generated tokenizers are very fast at recognizing low-level patterns in input streams, but the regular-expression minilanguage that lex knows is not good at counting things, or recognizing recursively nested structures. For parsing those, you want yacc. On the other hand, while you theoretically could write a yacc grammar to do its own token-gathering, the grammar to specify that would be hugely bloated and the parser extremely slow. For tokenizing input, you want lex. Thus, these tools are symbiotic.

If you can implement your parser in a higher-level language than C (which we recommend you do; see Chapter14 for discussion), then look for equivalent facilities like Python's PLY (which covers both lex and yacc)^[131] or Perl's PY and Parse::Yapp modules, or Java's CUP,^[132] Jack,^[133] or Yacc/M^[134] packages.

As with macro processors, one of the problems with code generators and preprocessors is that compile-time errors in the generated code may carry line numbers that are relative to the generated code (which you don't want to edit) rather than the generator input (which is where you need to make corrections). yacc and lex address this by generating the same #line constructs that the C preprocessor does; these set the current line number for error reporting so the numbers will come out right. Any program that generates C or C++ should do likewise.

More generally, well-designed procedural-code generators should never require the user to hand-alter or even look at the generated parts. Getting those right is the code generator's job.

Case Study: The `fetchmailrc` Grammar

The canonical demonstration example that seems to have appeared in every lex and yacc tutorial ever written is a toy interactive calculator program that parses and evaluates arithmetic expressions entered by the user. We will spare you yet another repetition of this cliche; if you are interested, consult the source code of the bc(1) and dc(1) calculator implementations from the GNU project , or the paradigm example ‘hoc’^[135] from [Kernighan-Pike84].

Instead, the grammar of fetchmail's run-control-file parser provides a good medium-sized case study in lex and yacc usage. There are a couple of points of interest here.

The lex specification, in rcfile_l.l, is a very typical implementation of a shell-like syntax. Note how two complementary rules support either single or double-quoted strings; this is a good idea in general. The rules for accepting (possibly signed) integer literals and discarding comments are also pretty generic.

The yacc specification, in rcfile_y.y, is long but straightforward. It does not perform any fetchmail actions, just sets bits in a list of internal control blocks. After startup, fetchmail's normal mode of operation is just to repeatedly walk that list, using each record to drive a retrieval session with a remote site.

[an error occurred while processing this directive]

The Art of Unix Programming
Prev	Home	Next

Special-Purpose Code Generators

yacc and lex

Case Study: The fetchmailrc Grammar

Case Study: The `fetchmailrc` Grammar