Good Protocols Make GoodPractice
It's a well-known fact that computing devices such as the abacus
were invented thousands of years ago. But it's not well known that the
first use of a common computer protocol occurred in the Old
Testament. This, of course, was when Moses aborted the Egyptians'
process with a control-sea.
--
Tom Galloway
rec.arts.comics, February 1992
In this chapter, we'll look at what the Unix tradition has to
tell us about two different kinds of design that are closely related:
the design of file formats for retaining application data in permanent
storage, and the design of application protocols for passing data
and commands between cooperating programs, possibly over a
network.
What unifies these two kinds of design is that they both involve
the serialization of in-memory data structures. For the internal
operation of computer programs, the most convenient representation of
a complex data structure is one in which all fields have the machine's
native data format (e.g. two's-complement binary for integers) and all
pointers are actual memory addresses (as opposed, say, to being named
references). But these representations are not well suited to storage
and transmission; memory addresses in the data structure lose their
meaning outside memory, and emitting raw native data formats causes
interoperability problems passing data between machines with different
conventions (big- vs. little-endian, say, or 32-bit vs. 64-bit).
For transmission and storage, the traversable, quasi-spatial
layout of data structures like linked lists needs to be flattened or
serialized into a byte-stream representation from which the structure
can later be recovered. The serialization (save) operation is
sometimes called marshaling and its inverse
(load) operation unmarshaling. These terms
are usually applied with respect to objects in an
OO
language like C++ or Python or Java, but could be used with equal justice
of operations like loading a graphics file into the internal storage
of a graphics editor and saving it out after modifications.
A significant percentage of what C and C++ programmers maintain is ad-hoc code for
marshaling and unmarshaling operations — even when the
serialized representation chosen is as simple as a binary structure
dump (a common technique under non-Unix environments). Modern
languages like Python and Java tend to have built-in unmarshal and
marshal functions that can be applied to any object or byte-stream
representing an object, and that reduce this labor substantially.
But these nave methods are often unsatisfactory for
various reasons, including both the machine-interoperability problems
we mentioned above and the negative trait of being opaque to other
tools. When the application is a network protocol, economy may demand
that an internal data structure (such as, say, a message with source
and destination addresses) be serialized not into a single blob of
data but into a series of attempted transactions or messages which the
receiving machine may reject (so that, for example, a large message can
be rejected if the destination address is invalid).
Interoperability,
transparency,
extensibility,
and storage or transaction economy: these are the important themes in
designing file formats and application protocols. Interoperability
and transparency demand that we focus such designs on clean data
representations, rather than putting convenience of implementation or
highest possible performance first. Extensibility also favors textual
protocols, since binary ones are often harder to extend or subset
cleanly. Transaction economy sometimes pushes in the opposite
direction — but we shall see that putting that criterion first
is a form of premature
optimization
that it is often wise to resist.
Finally, we must note a difference between data file formats and
the run-control files that are often used to set the startup options
of Unix programs. The most basic difference is that (with sporadic
exceptions like GNU Emacs's configuration interface) programs don't
normally modify their own run-control files — the information
flow is one-way, from file read at startup time to application
settings. Data-file formats, on the other hand, associate properties
with named resources and are both read and written by their
applications. Configuration files are generally hand-edited and small,
whereas data files are program-generated and can become arbitrarily
large.
Historically, Unix has related but different sets of conventions
for these two kinds of representation. The conventions for run
control files are surveyed in Chapter10; only conventions for data files are
examined in this chapter.