Unix has a long-standing tradition of hosting tools that are
specifically designed to generate code for various special purposes.
The venerable monuments of this tradition, which go back to Version 7
and earlier days, and were actually used to write the original
Portable C Compiler back in the 1970s, are
lex(1)
and
yacc(1).
Their modern, upward-compatible successors are
flex(1)
and
bison(1),
part of the GNU
toolkit and still heavily used today. These programs have set an
example that is carried forward in projects like GNOME's
Glade interface builder.
yacc and
lex are tools for generating language
parsers. We observed in Chapter8 that your first minilanguage is all
too likely to be an accident rather than a design. That accident is
likely to have a hand-coded parser that costs you far too much
maintenance and debugging time — especially if you have not
realized it is a parser, and have thus failed to properly separate it
from the remainder of your application code. Parser generators are
tools for doing better than an accidental, ad-hoc implementation; they
don't just let you express your grammar specification at a higher
level, they also wall off all the parser's implementation complexity
from the rest of your code.
If you reach a point where you are planning to implement a
minilanguage from scratch, rather than by extending or embedding an
existing scripting language or parsing XML,
yacc and lex
will probably be your most important tools after your
C compiler.
lex and
yacc each generate code for a single
function — respectively, “get a token from the input
stream” and “parse a sequence of tokens to see if it
matches a grammar”. Usually, the
yacc-generated parser function calls a
Lex-generated tokenizer function each time it wants to get another
token. If there are no user-written C callbacks at all in the
yacc-generated parser, all it will do is a
syntax check; the value returned will tell the caller if the input
matched the grammar it was expecting.
More usually, the user's C code, embedded in the generated
parser, populates some runtime data structures as a side-effect of
parsing the input. If the minilanguage is declarative, your
application can use these runtime data structures directly. If your
design was an imperative minilanguage, the data structures might
include a parse tree which is immediately fed to some kind of
evaluation function.
yacc has a rather ugly interface,
through exported global variables with the name prefix
yy_. This is because it predates structs in C; in
fact, yacc predates C itself; the first
implementation was written in C's predecessor B. The crude
though effective algorithm yacc-generated
parsers use to try to recover from parse errors (pop tokens until an
explicit error production is matched) can also lead to
problems, including memory leaks.
|
If you are building parse trees, using malloc to make nodes, and
you start popping things off the stack in error recovery, you don't
get to recover (free) the storage. In general, Yacc can't do it,
since it doesn't know enough about what's on the stack. If the yacc
parser were in C++, it could assume that the values were classes and
“destruct” them. In “real” compilers, parse
tree nodes are generated using an arena-based allocator, so the nodes
don't leak, but there is a logical leak anyway that needs to be
thought about to make industrial-strength error recovery.
|
|
| --
Steve Johnson
|
|
lex is a lexical analyzer generator.
It's a member of the same functional family as
grep(1)
and
awk(1),
but more powerful because it enables you to arrange for arbitrary C
code to be executed on each match. It accepts a declarative
minilanguage and emits skeleton C code.
A crude but useful way to think about what a
lex-generated tokenizer does is as a sort
of inverse
grep(1).
Where
grep(1)
takes a single regular expression and returns a list of matches in the
incoming data stream, each call to a
lex-generated tokenizer takes a list of
regular expressions and indicates which expression occurs next in the
datastream.
|
Splitting input analysis into tokenizing input and parsing
the token stream is a useful tactic even if you're not using Yacc and
Lex and your “tokens” are nothing like the usual ones in
a compiler. More than once I've found that splitting input handling
into two levels made the code much simpler and easier to understand,
despite the complexity added by the split itself.
|
|
| --
Henry Spencer
|
|
lex was written to automate the task
of generating lexical analyzers (tokenizers) for compilers. It turned
out to have a surprisingly wide range of uses for other kinds of
pattern recognition, and has since been described as “the
Swiss-army knife of Unix programming”.[130]
If you are attacking any kind of pattern-recognition or
state-machine problem in which all the possible input stimuli will fit
in a byte, lex may enable you to generate
code that will be more efficient and reliable than a hand-crafted
state machine.
|
John Jarvis at Holmdel [an AT&T laboratory] used
lex to find faults in circuit boards, by
scanning the board, using a chain-encoding technique to represent the
edges of areas on the board, and then using Lex to define patterns
that would catch common fabrication errors.
|
|
| --
Mike Lesk
|
|
Most importantly, the lex
specification minilanguage is much higher-level and more
compact than
equivalent handcrafted C. Modules are available to use
flex, the open-source version, with
Perl (find them
with a Web search for “lex perl”), and a work-alike
implementation is part of PLY in
Python.
lex generates parsers that are up
to an order of magnitude slower than hand-coded parsers. This is
not a good reason to hand-code, however; it's an argument for
prototyping with lex and hand-hacking
only if prototyping reveals an actual bottleneck.
yacc is a parser generator. It, too,
was written to automate part of the job of writing compilers. It
takes as input a grammar specification in a declarative minilanguage
resembling BNF (Backus-Naur Form) with C code associated with each
element of the grammar. It generates code for a parser function
that, when called, accepts text matching the grammar from an input
stream. As each grammar element is recognized, the parser function
runs the associated C code.
The combination of lex and
yacc is very effective for writing language
interpreters of all kinds. Though most Unix programmers never get to
do the kind of general-purpose compiler-building that these tools were
meant to assist, they're extremely useful for writing parsers for
run-control file syntaxes and domain-specific minilanguages.
lex-generated tokenizers are very
fast at recognizing low-level patterns in input streams, but the
regular-expression minilanguage that lex
knows is not good at counting things, or recognizing recursively
nested structures. For parsing those, you want
yacc. On the other hand, while you
theoretically could write a yacc grammar to
do its own token-gathering, the grammar to specify that would be
hugely bloated and the parser extremely slow. For tokenizing input,
you want lex. Thus, these tools are
symbiotic.
If you can implement your parser in a higher-level language than
C (which we recommend you do; see Chapter14 for discussion), then look for equivalent
facilities like Python's PLY (which covers both
lex and yacc)[131]
or Perl's PY
and Parse::Yapp modules, or Java's CUP,[132]
Jack,[133]
or Yacc/M[134]
packages.
As with macro processors, one of the problems with code
generators and preprocessors is that compile-time errors in the
generated code may carry line numbers that are relative to the
generated code (which you don't want to edit) rather than the
generator input (which is where you need to make corrections).
yacc and lex
address this by generating the same #line constructs that the C
preprocessor does; these set the current line number for error
reporting so the numbers will come out right. Any program that
generates C or
C++ should do
likewise.
More generally, well-designed procedural-code generators
should never require the user to hand-alter or even look at the
generated parts. Getting those right is the code generator's
job.
Case Study: The fetchmailrc Grammar
The canonical demonstration example that seems to have appeared
in every lex and
yacc tutorial ever written is a toy
interactive calculator program that parses and evaluates arithmetic
expressions entered by the user. We will spare you yet another
repetition of this cliche; if you are interested, consult the source
code of the
bc(1)
and
dc(1)
calculator implementations from the GNU project, or the paradigm example
‘hoc’[135]
from [Kernighan-Pike84].
Instead, the grammar of fetchmail's
run-control-file parser provides a good medium-sized case study in
lex and yacc
usage. There are a couple of points of interest here.
The lex specification, in
rcfile_l.l, is a very typical implementation of a
shell-like syntax. Note how two complementary rules support either
single or double-quoted strings; this is a good idea in general. The
rules for accepting (possibly signed) integer literals and discarding
comments are also pretty generic.
The yacc specification, in
rcfile_y.y, is long but straightforward. It does
not perform any fetchmail actions, just
sets bits in a list of internal control blocks. After startup,
fetchmail's normal mode of operation is
just to repeatedly walk that list, using each record to drive a
retrieval session with a remote site.