|
Unix Programming - Taxonomy of Unix IPC Methods - Pipes, Redirection, and Filters
Pipes, Redirection, and Filters
After Ken Thompson and Dennis
Ritchie, the
single most important formative figure of early Unix was probably Doug
McIlroy. His
invention of the
pipe
construct reverberated
through the design of Unix, encouraging its nascent do-one-thing-well
philosophy and inspiring most of the later forms of IPC in the Unix
design (in particular, the socket abstraction used for
networking).
Pipes depend on the convention that every program has initially
available to it (at least) two I/O data streams: standard input and
standard output (numeric file descriptors 0 and 1 respectively).
Many programs can be written as
filters
, which read
sequentially from standard input and write only to standard
output.
Normally these streams are connected to the user's keyboard and
display, respectively. But Unix shells universally support
redirection
operations which connect these
standard input and output streams to files. Thus, typing
ls >foo
sends the output of the directory lister
ls(1)
to a file named ‘foo’. On the other hand, typing:
wc <foo
causes the word-count utility
wc(1)
to take its standard input from the file ‘foo’, and
deliver a character/word/line count to standard output.
The pipe operation connects the standard output of one program
to the standard input of another. A chain of programs connected in
this way is called a
pipeline
. If we write
ls | wc
we'll see a character/word/line count for the current directory
listing. (In this case, only the line count is really likely to be
useful.)
|
One favorite pipeline was “bc | speak”—a
talking desk calculator. It knew number names up to a
vigintillion.
|
|
| --
Doug McIlroy
|
|
It's important to note that all the stages in a pipeline run
concurrently. Each stage waits for input on the output of the
previous one, but no stage has to exit before the next can run. This
property will be important later on when we look at interactive uses
of pipelines, like sending the lengthy output of a command to
more(1).
It's easy to underestimate the power of combining pipes and
redirection. As an instructive example, The Unix Shell As
a 4GL [Schaffer-Wolf] shows that with these
facilities as a framework, a handful of simple utilities can be
combined to support creating and manipulating relational databases
expressed as simple textual tables.
The major weakness of pipes is that they are unidirectional.
It's not possible for a pipeline component to pass control information
back up the pipe other than by terminating (in which case the previous
stage will get a SIGPIPE signal on the
next write). Accordingly, the protocol for passing data is simply the
receiver's input format.
So far, we have discussed anonymous pipes
created by the shell. There is a variant called a
named
pipe
which is a special kind of file. If two programs open
the file, one for reading and the other for writing, a named pipe acts
like a pipe-fitting between them. Named pipes are a bit of a
historical relic; they have been largely displaced from use by named
sockets, which we'll discuss below. (For more
on the history of this relic, see the discussion of System V IPC below.)
Case Study: Piping to a Pager
Pipelines have many uses. For one example, Unix's process lister
ps(1)
lists processes to standard output without caring that a long listing
might scroll off the top of the user's display too quickly for the
user to see it. Unix has another program,
more(1),
which displays its standard input in screen-sized chunks, prompting
for a user keystroke after displaying each screenful.
Thus, if the user types “ps |
more”, piping the output of
ps(1)
to the input of
more(1),
successive page-sized pieces of the list of processes will be
displayed after each keystroke.
The ability to combine programs like this can be extremely
useful. But the real win here is not cute combinations; it's that
because both pipes and
more(1)
exist,
other programs can be simpler
. Pipes mean
that programs like
ls(1)
(and other programs that write to standard out) don't have to grow
their own pagers — and we're saved from a world of a thousand
built-in pagers (each, naturally, with its own divergent look and
feel). Code bloat is avoided and global complexity reduced.
As a bonus, if anyone needs to customize pager behavior, it can
be done in
one
place, by changing
one
program. Indeed, multiple pagers can exist,
and will all be useful with every application that writes to standard
output.
In fact, this has actually happened. On modern Unixes,
more(1)
has been largely replaced by
less(1),
which adds the capability to scroll back in the displayed file rather
than just forward.[70] Because
less(1)
is decoupled from the programs that use it, it's possible to simply
alias ‘more’ to ‘less’ in your shell, set the
environment variable PAGER to ‘less’ (see
Chapter10), and get
all the benefits of a better pager with all properly-written Unix
programs.
Case Study: Making Word Lists
A more interesting example is one in which pipelined programs
cooperate to do some kind of data transformation for which, in less
flexible environments, one would have to write custom code.
Consider the pipeline
tr -c '[:alnum:]' '[\n*]' | sort -iu | grep -v '^[0-9]*$'
The first command translates non-alphanumerics on standard input
to newlines on standard output. The second sorts lines on standard
input and writes the sorted data to standard output, discarding all
but one copy of spans of adjacent identical lines. The third
discards all lines consisting solely of digits. Together, these
generate a sorted wordlist to standard output from text on standard
input.
Shell source code for the program
pic2graph(1)
ships with the groff suite of
text-formatting tools from the Free Software
Foundation. It translates diagrams
written in the PIC language to bitmap images. Example7.1 shows the pipeline at the heart of this code.
Example7.1.The pic2graph pipeline.
(echo ".EQ"; echo $eqndelim; echo ".EN"; echo ".PS";cat;echo ".PE")|\
groff -e -p $groffpic_opts -Tps >${tmp}.ps \
&& convert -crop 0x0 $convert_opts ${tmp}.ps ${tmp}.${format} \
&& cat ${tmp}.${format}
The
pic2graph(1)
implementation illustrates how much one pipeline can do purely by
calling preexisting tools. It starts by massaging its input into an
appropriate form, continues by feeding it through
groff(1)
to produce PostScript, and finishes by converting the PostScript to a
bitmap. All these details are hidden from the user, who simply sees
PIC source go in one end and a bitmap ready for inclusion in a Web
page come out the other.
This is an interesting example because it illustrates how
pipes and filtering can
adapt programs to unexpected uses. The program that interprets PIC,
pic(1),
was originally designed only to be used for embedding diagrams in
typeset documents. Most of the other programs in the toolchain it was
part of are now semiobsolescent. But PIC remains handy for new uses,
such as describing diagrams to be embedded in HTML. It gets a renewed
lease on life because tools like
pic2graph(1)
can bundle together all the machinery needed to convert the output of
pic(1)
into a more modern format.
We'll examine
pic(1)
more closely, as a minilanguage design, in Chapter8.
Case Study:
bc(1)
and
dc(1)
Part of the classic Unix toolkit dating back to Version 7 is a
pair of calculator programs. The
dc(1)
program is a simple calculator that accepts text lines consisting of
reverse-Polish notation (RPN) on standard input and emits calculated
answers to standard output. The
bc(1)
program accepts a more elaborate infix syntax resembling conventional
algebraic notation; it includes as well the ability to set and read
variables and define functions for elaborate formulas.
While the modern GNU implementation of
bc(1)
is standalone, the classic version passed commands to
dc(1)
over a pipe. In this division of labor,
bc(1)
does variable substitution and function expansion and translates infix
notation into reverse-Polish — but doesn't actually do
calculation itself, instead passing RPN translations of input
expressions to
dc(1)
for evaluation.
There are clear advantages to this separation of function. It
means that users get to choose their preferred notation, but the logic
for arbitrary-precision numeric calculation (which is moderately
tricky) does not have to be duplicated. Each of the pair of programs
can be less complex than one calculator with a choice of notations
would be. The two components can be debugged and mentally modeled
independently of each other.
In Chapter8 we
will reexamine these programs from a slightly different example, as
examples of domain-specific minilanguages.
Anti-Case Study: Why Isn't fetchmail a Pipeline?
In Unix terms, fetchmail is an
uncomfortably large program that bristles with options. Thinking
about the way mail transport works, one might think it would be
possible to decompose it into a pipeline. Suppose for a moment it were
broken up into several programs: a couple of fetch programs to get
mail from POP3 and IMAP sites, and a local SMTP injector. The
pipeline could pass Unix mailbox format. The present elaborate
fetchmail configuration could be replaced
by a shellscript containing command lines. One could even insert
filters in the pipeline to block spam.
#!/bin/sh
imap jrandom@imap.ccil.org | spamblocker | smtp jrandom
imap jrandom@imap.netaxs.com | smtp jrandom
# pop ed@pop.tems.com | smtp jrandom
This would be very elegant and Unixy. Unfortunately, it can't
work. We touched on the reason earlier; pipelines are
unidirectional.
One of the things the fetcher program
(imap or pop)
would have to do is decide whether to send a delete request for each
message it fetches. In fetchmail's present
organization, it can delay sending that request to the POP or IMAP
server until it knows that the local SMTP listener has accepted
responsibility for the message. The pipelined, small-component
version would lose that property.
Consider, for example, what would happen if the
smtp injector fails because the SMTP
listener reports a disk-full condition. If the fetcher has already
deleted the mail, we lose. This means the fetcher cannot delete mail
until it is notified to do so by the smtp
injector. This in turn raises a host of questions. How would they
communicate? What message, exactly, would the injector pass back?
The global complexity of the resulting system, and its vulnerability
to subtle bugs, would almost certainly be higher than that of a
monolithic program.
Pipelines are a marvelous tool, but not a universal one.
[an error occurred while processing this directive]
|