The Importance of Being Textual
Pipes and sockets will pass binary data as well as text. But
there are good reasons the examples we'll see in Chapter7 are textual: reasons that hark back
to Doug McIlroy's advice quoted in Chapter1. Text streams are a valuable universal
format because they're easy for human beings to read, write, and edit
without specialized tools. These formats are (or can be designed to
be) transparent.
Also, the very limitations of text streams help enforce
encapsulation. By discouraging elaborate representations with rich,
densely encoded structure, text streams also discourage programs from
being promiscuous with each other about their internal states and help
enforce encapsulation. We'll return to this point at the end of Chapter7 when we discuss RPC.
When you feel the urge to design a complex binary file format,
or a complex binary application protocol, it is generally wise to
lie down until the feeling passes. If performance is what you're
worried about, implementing compression on the text protocol stream
either at some level below or above the application protocol will
give you a cleaner and perhaps better-performing design than a
binary protocol (text compresses well, and quickly).
|
A bad example of binary formats in Unix history was the way
device-independent troff read a binary file
containing device information, supposedly for speed. The initial
implementation generated that binary file from a text description in a
somewhat unportable way. Faced with a need to port
ditroff
quickly
to a new
machine, rather than reinvent the binary goo, I ripped it out and just
had ditroff read the text file. With carefully
crafted file-reading code, the speed penalty was negligible.
|
|
| --
Henry Spencer
|
|
Designing a textual protocol tends to future-proof your system.
One specific reason is that ranges on numeric fields aren't implied by
the format itself. Binary formats usually specify the number of bits
allocated to a given value, and extending them is difficult. For
example, IPv4 only allows 32 bits for an address. To extend address
size to 128 bits (as done by IPv6) requires a major
revamping.[51] In contrast, if you need a larger value
in a text format, just write it. It may be that a given program can't
receive values in that range, but it's usually easier to modify the
program than to modify all the data stored in that format.
The only good justification for a binary protocol is if you're
going to be manipulating large enough data sets that you're genuinely
worried about getting the most bit-density out of your media, or if
you're very concerned about the time or instruction budget required to
interpret the data into an in-core structure. Formats for large
images and multimedia are sometimes an example of the former, and
network protocols with hard latency requirements sometimes an
example of the latter.
|
The reciprocal problem with SMTP or HTTP-like text protocols is
that they tend to be expensive in bandwidth and slow to parse. The
smallest X request is 4 bytes: the smallest HTTP request is about 100
bytes. X requests, including amortized overhead of transport, can be
executed in the order of 100 instructions; at one point, an
Apache [web server]
developer proudly indicated they were down to 7000 instructions. For
graphics, bandwidth becomes everything on output; hardware is designed
such that these days the graphics-card bus is
the
bottleneck for small operations, so any protocol had better be very
tight if it is not to be a worse bottleneck. This is the extreme
case.
|
|
| --
Jim Gettys
|
|
These concerns are valid in other extreme cases as well as in X
— for example, in the design of graphics file formats intended
to hold very large images. But they are usually just another case of
premature-optimization
fever. Textual formats don't necessarily have much lower bit density
than binary ones; they do after all use seven out of eight bits per
byte. And what you gain by not having to parse text, you generally
lose the first time you need to generate a test load, or to eyeball a
program-generated example of your format and figure out what's in
there.
In addition, the kind of thinking that goes into designing tight
binary formats tends to fall down on making them cleanly extensible.
The X designers experiencedthis:
|
Against the current X framework is the fact we didn't design enough of
a structure to make it easier to ignore trivial extensions to the protocol;
we can do this some of the time, but a bit better framework would have been
good.
|
|
| --
Jim Gettys
|
|
When you think you have an extreme case that justifies a binary
file format or protocol, you need to think very carefully about
extensibility
and leaving room in the design for growth.
Case Study: Unix Password File Format
On many operating systems, the per-user data required to
validate logins and start a user's session is an opaque binary
database. Under Unix, by contrast, it's a text file with records one
per line and colon-separated fields.
Example5.1 consists of some randomly-chosen
example lines:
Example5.1.Password file example.
games:*:12:100:games:/usr/games:
gopher:*:13:30:gopher:/usr/lib/gopher-data:
ftp:*:14:50:FTP User:/home/ftp:
esr:0SmFuPnH5JlNs:23:23:Eric S. Raymond:/home/esr:
nobody:*:99:99:Nobody:/:
Without even knowing anything about the semantics of the fields,
we can notice that it would be hard to pack the data much tighter in
a binary format. The colon sentinel characters would have to have
functional equivalents taking at least as much space (usually
either count bytes or NULs). The per-user records would either have
to have terminators (which could hardly be shorter than a single
newline) or else be wastefully padded out to a fixed length.
Actually the prospects for saving space through binary encoding
pretty much vanish if you know the actual semantics of the data. The
numeric user ID (3rd) and group ID (4th) fields are integers, thus on
most machines a binary representation would be at least 4 bytes, and
longer than the text for values up to 999. But let's agree to ignore
this for now and suppose the best case that the numeric fields have a
0-255 range.
We could tighten up the numeric fields (3rd and 4th) by
collapsing the numerics to single bytes, and the password strings
(2nd) to an 8-bit encoding. On this example, that would give about an
8% size decrease.
That 8% of putative inefficiency buys us a lot. It avoids
putting an arbitrary limit on the range of the numeric fields. It
gives us the ability to modify the password file with any old text
editor of our choice, rather than having to build a specialized tool
to edit a binary format (though in the case of the password file
itself, we have to be extra careful about concurrent edits). And it
gives us the ability to do ad-hoc searches and filters and reports on
the user account information with text-stream tools such as
grep(1).
We do have to be a bit careful about not embedding a colon in
any of the textual fields. Good practice is to tell the file write
code to precede embedded colons with an escape character, and then to
tell the file read code to interpret it. Unix tradition favors
backslash for this use.
The fact that structural information is conveyed by field
position rather than an explicit tag makes this format faster
to read and write, but a bit rigid. If the set of properties
associated with a key is expected to change with any frequency,
one of the tagged formats described below might be a better choice.
Economy is not a major issue with password files to begin with,
as they're normally read seldom[52] and infrequently modified. Interoperability
is not an issue, since various data in the file (notably user and
group numbers) are not portable off the originating machine. For
password files, it's therefore quite clear that going where the
transparency
criterion leads was the right thing.
[an error occurred while processing this directive]