| |
Unix Programming - Data File Metaformats - Unix Textual File Format Conventions
Unix Textual File Format Conventions
There are long-standing Unix traditions about how textual data
formats ought to look. Most of these derive from one or more of the
standard Unix metaformats we've just described. It is wise to follow these
conventions unless you have strong and specific reasons to do otherwise.
In Chapter10 we
will discuss a different set of conventions used for program
run-control files, but you should notice that it will share some of
these same rules (especially about the lexical level, the rules by
which characters are assembled into tokens).
-
One record per newline-terminated line, if
possible.
This makes it easy to extract records with
text-stream tools. For data interchange with other operating systems,
it's wise to make your file-format parser indifferent to whether
the line ending is LF or CR-LF. It's also conventional to ignore
trailing whitespace in such formats; this protects against common
editor bobbles.
-
Less than 80 characters per line, if possible.
This
makes the format browseable in an ordinary-sized terminal window.
If many records must be longer than 80 characters, consider a
stanza format (see below).
-
Use # as an introducer for comments.
It is good to have a way to embed annotations and comments in
data files. It's best if they're actually part of the file
structure, and so will be preserved by tools that know its format.
For comments that are not preserved during parsing, # is the
conventional start character.
-
Support the backslash convention.
The least
surprising way to support embedding nonprintable control characters
is by parsing C-like backslash escapes — \n for a newline, \r for a carriage return, \t for a tab, \b
for backspace, \f for formfeed,
\e for ASCII escape (27), \nnn or \onnn or
\0nnn for the character with octal
value nnn, \xnn for the character with hexadecimal value
nn, \dnnn for the character with decimal value
nnn, \\
for a literal backslash. A newer convention, but one worth following,
is the use of \unnnn for a hexadecimal
Unicode literal.
-
In one-record-per-line formats, use colon or any run
of whitespace as a field separator.
The colon convention
seems to have originated with the Unix password file. If your fields
must contain instances of the separator(s), use a backslash as the
prefix to escape them.
-
Do not allow the distinction between tab and
whitespace to be significant.
This is a recipe for serious
headaches when the tab settings on your users' editors are different;
more generally, it's confusing to the eye. Using tab alone as a field
separator is especially likely to cause problems; allowing any run of
tabs and spaces to be a field separator, on the other hand, works
well.
-
Favor hex over octal.
Hex-digit pairs and
quads are easier to eyeball-map into bytes and today's 32- and 64-bit
words than octal digits of three bits each; also marginally more
efficient. This rule needs emphasizing because some older Unix tools
such as
od(1)
violate it; that's a legacy from the instruction field sizes in the
machine languages of older PDP minicomputers.
-
For complex records, use a ‘stanza’ format:
multiple lines per record, with a record separator line of %%\n or
%\n.
The separators make useful visual boundaries for human
beings eyeballing the file.
-
In stanza formats, either have one record field per
line or use a record format resembling RFC 822 electronic-mail headers,
with colon-terminated field-name keywords leading fields.
The second choice is appropriate when fields are often either absent
or longer than 80 characters, or when records are sparse (e.g., often
with empty fields).
-
In stanza formats, support line continuation.
When interpreting the file, either discard backslash followed by
whitespace or interpret newline followed by whitespace equivalently to a single
space, so that a long logical line can be folded into short (easily
editable!) physical lines. It's also conventional to ignore
trailing whitespace in these formats; this convention protects against common
editor bobbles.
-
Either include a version number or design the format
as self-describing chunks independent of each other.
If
there is even the faintest possibility that the format will have to be
changed or extended, include a version number so your code can
conditionally do the right thing on all versions. Alternatively,
design the format as self-describing chunks so that you can add new
chunk types without instantly breaking old code.
-
Beware of floating-point round-off problems.
Conversion of floating-point numbers from binary to text format and
back can lose precision, depending on the quality of the conversion
library you are using. If the structure you are
marshaling/unmarshaling contains floating point, you should test the
conversion in both directions. If it looks like conversion in
either direction is subject to roundoff errors, be prepared to dump
the floating-point field as raw binary instead, or a string encoding
thereof. If you're coding in C or some language that has access to
C printf/scanf, the C99 %a specifier may solve this problem.
-
Don't bother compressing or binary-encoding just part of
the file.
See below...
[an error occurred while processing this directive]
|
|