A data file metaformat is a set of syntactic and lexical
conventions that is either formally standardized or sufficiently well
established by practice that there are standard service libraries to
handle marshaling and unmarshaling it.
Unix has evolved or adopted metaformats suitable for a wide range
of applications. It is good practice to use one of these (rather than
an idiosyncratic custom format) wherever possible. The benefits begin
with the amount of custom parsing and generation code that you may be
able to avoid writing by using a service library. But the most
important benefit is that developers and even many users will instantly
recognize these formats and feel comfortable with them, which reduces
the friction costs of learning new programs.
In the following discussion, when we refer to “traditional
Unix tools” we are intending the combination of
grep(1),
sed(1),
awk(1),
tr(1),
and
cut(1)
for doing text searches and transformations.
Perl and other
scripting languages tend to have good native support for
parsing the line-oriented formats that these tools encourage.
Here, then, are the standard formats that can serve you as models.
DSV stands for Delimiter-Separated
Values. Our first case study in textual metaformats was
the /etc/passwd file, which is a DSV format with
colon as the value separator. Under Unix, colon is the default
separator for DSV formats in which the field values may contain
whitespace.
/etc/passwd format (one record
per line, colon-separated fields) is very traditional under Unix
and frequently used for tabular data. Other classic examples
include the /etc/group file describing security
groups and the /etc/inittab file used to control
startup and shutdown of Unix service programs at different run levels
of the operating system.
Data files in this style are expected to support inclusion of
colons in the data fields by backslash escaping. More generally,
code that reads them is expected to support record continuation by
ignoring backslash-escaped newlines, and to allow embedding
nonprintable character data by C-style backslash escapes.
This format is most appropriate when the data is tabular,
keyed by a name (in the first field), and records are typically
short (less than 80 characters long). It works well with
traditional Unix tools.
One occasionally sees field separators other than the colon,
such as the pipe character | or even an ASCII NUL. Old-school Unix
practice used to favor tabs, a preference reflected in the defaults
for
cut(1)
and
paste(1);
but this has gradually changed as format designers became aware of the
many small irritations that ensue from the fact that tabs and spaces
are not visually distinguishable.
This format is to Unix what CSV (comma-separated value) format
is under Microsoft Windows and elsewhere outside the Unix world.
CSV (fields separated by commas, double quotes used to escape
commas, no continuation lines) is rarely found under Unix.
In fact, the Microsoft version of CSV is a textbook example of
how
not
to design a textual file format. Its
problems begin with the case in which the separator character (in this
case, a comma) is found inside a field. The Unix way would be to
simply escape the separator with a backslash, and have a double escape
represent a literal backslash. This design gives us a single special case
(the escape character) to check for when parsing the file, and only a
single action when the escape is found (treat the following character
as a literal). The latter conveniently not only handles the separator
character, but gives us a way to handle the escape character and
newlines for free. CSV, on the other hand, encloses the entire field
in double quotes if it contains the separator. If the field contains
double quotes, it must also be enclosed in double quotes, and the
individual double quotes in the field must themselves be repeated
twice to indicate that they don't end the field.
The bad results of proliferating special cases are twofold.
First, the complexity of the parser (and its vulnerability to bugs) is
increased. Second, because the format rules are complex and
underspecified, different implementations diverge in their handling of
edge cases. Sometimes continuation lines
are
supported, by starting the last field of the line with an unterminated
double quote — but only in some products! Microsoft has
incompatible versions of CSV files between its own applications, and
in some cases between different versions of the same application
(Excel being the obvious example here).