Unix Programming - Data File Metaformats

Data File Metaformats

A data file metaformat is a set of syntactic and lexical conventions that is either formally standardized or sufficiently well established by practice that there are standard service libraries to handle marshaling and unmarshaling it.

Unix has evolved or adopted metaformats suitable for a wide range of applications. It is good practice to use one of these (rather than an idiosyncratic custom format) wherever possible. The benefits begin with the amount of custom parsing and generation code that you may be able to avoid writing by using a service library. But the most important benefit is that developers and even many users will instantly recognize these formats and feel comfortable with them, which reduces the friction costs of learning new programs.

In the following discussion, when we refer to “traditional Unix tools” we are intending the combination of grep(1), sed(1), awk(1), tr(1), and cut(1) for doing text searches and transformations. Perl and other scripting languages tend to have good native support for parsing the line-oriented formats that these tools encourage.

Here, then, are the standard formats that can serve you as models.

DSV Style

DSV stands for Delimiter-Separated Values. Our first case study in textual metaformats was the /etc/passwd file, which is a DSV format with colon as the value separator. Under Unix, colon is the default separator for DSV formats in which the field values may contain whitespace.

/etc/passwd format (one record per line, colon-separated fields) is very traditional under Unix and frequently used for tabular data. Other classic examples include the /etc/group file describing security groups and the /etc/inittab file used to control startup and shutdown of Unix service programs at different run levels of the operating system.

Data files in this style are expected to support inclusion of colons in the data fields by backslash escaping. More generally, code that reads them is expected to support record continuation by ignoring backslash-escaped newlines, and to allow embedding nonprintable character data by C-style backslash escapes.

This format is most appropriate when the data is tabular, keyed by a name (in the first field), and records are typically short (less than 80 characters long). It works well with traditional Unix tools.

One occasionally sees field separators other than the colon, such as the pipe character | or even an ASCII NUL. Old-school Unix practice used to favor tabs, a preference reflected in the defaults for cut(1) and paste(1); but this has gradually changed as format designers became aware of the many small irritations that ensue from the fact that tabs and spaces are not visually distinguishable.

This format is to Unix what CSV (comma-separated value) format is under Microsoft Windows and elsewhere outside the Unix world. CSV (fields separated by commas, double quotes used to escape commas, no continuation lines) is rarely found under Unix.

In fact, the Microsoft version of CSV is a textbook example of how not to design a textual file format. Its problems begin with the case in which the separator character (in this case, a comma) is found inside a field. The Unix way would be to simply escape the separator with a backslash, and have a double escape represent a literal backslash. This design gives us a single special case (the escape character) to check for when parsing the file, and only a single action when the escape is found (treat the following character as a literal). The latter conveniently not only handles the separator character, but gives us a way to handle the escape character and newlines for free. CSV, on the other hand, encloses the entire field in double quotes if it contains the separator. If the field contains double quotes, it must also be enclosed in double quotes, and the individual double quotes in the field must themselves be repeated twice to indicate that they don't end the field.

The bad results of proliferating special cases are twofold. First, the complexity of the parser (and its vulnerability to bugs) is increased. Second, because the format rules are complex and underspecified, different implementations diverge in their handling of edge cases. Sometimes continuation lines are supported, by starting the last field of the line with an unterminated double quote — but only in some products! Microsoft has incompatible versions of CSV files between its own applications, and in some cases between different versions of the same application (Excel being the obvious example here).

[an error occurred while processing this directive]

The Art of Unix Programming
Prev	Home	Next