What does it stand for?
CSV is short for Comma Separated Values – though the comma itself isn’t mandatory. In principle, any character can be used as a separator – though of course it’s sensible to use characters that appear only rarely within the values themselves. A popular choice alongside the comma is the semi-colon (;), but out in the wild you would also encounter the vertical bar (|), the at-sign (@), the tab, and other outlandish ideas, including unprintable characters such as the ASCII 7 bell character.
A simple example
Here, the semi-colon has been used as a separator. As you can see, CSV is an incredibly economical format, as each value is separated by just one character. At the end of each record there is typically a line break, which also takes up one character, or at worst two characters. Fixed record can’t compete with that – let alone XML.
Let’s go through each line in detail.
- OH stands for order header here. This is the record type identifier, which lets a piece of software (or a human) recognise what kind of record is coming next.
- 4711 is the order number.
- K0815 is the customer number.
- 05.2013 is the order date, in standard German date format (dd.mm.yyyy).
- LIN is the line item identifier.
- 1 is the line item number.
- S123 is the item number.
- 5 is the quantity.
- 99 is the price per unit.
In CSV, unlike in fixed record, it isn’t advisable to omit line breaks. This is because a very practical rule is applied here: if there are no more values in a record, you can also omit the separators. Let’s assume that in a particular line item, the price per unit might be followed by the name and description of the item:
LIN;1;S123;5;9.99;USB Stick;Memory stick with USB 3.0 and 4 GB capacity
If we wanted to leave these two textual values out, we wouldn’t have to write:
Instead, we could simply leave out the last two semi-colons altogether. If this were a fixed record file, we would be able to skip the line break – but then we would also have to add extra spaces as filler characters. Not exactly a fair trade…
Record type identifiers
An individual message is typically made up of various parts. These might include a header (as in the example above), and often a trailer record at the end, as well as information on individual shipments, call-offs etc., details on line items, and so on. Typically, the record type will appear at the very beginning. This is extremely practical, as we can then apply just one simple rule:
Look at the first value in each line to find out what record type is coming next. With the help of the format definition, we now know what values will appear in which order. We simply chop each line up by its separators, and there we have it. A very easily digestible format.
Rules in practice
As mentioned at the beginning, it’s important to choose a separator character that only rarely appears in the values. “Only rarely” isn’t the same as “never”, however. So what do we do when the separator does turn up in a value?
LIN;1;S123;5;9.99;USB Stick;“Memory stick; USB 3.0; 4 GB of capacity”
Here, the semi-colon also appears twice in the item description. But you can already see how we resolve this: we surround the value with quotation marks. The use of the “character for this is itself very widespread, but not necessarily mandatory. The most commonly used alternative is the apostrophe (‘). Within a given file, the quotation character is just as fixed as the separator.
That brings us straight on to the next problem: What if the item description itself contains quotation marks (or another quotation character)? To get round this, the character is typically “escaped”, or invalidated. This is done using the backslash:
“This text contains \”quotation marks\” partway through.”
A \” is thus an invalidated “, which is interpreted merely as the character itself, and not as a quotation character.
And to add to the fun: If the backslash (\) itself appears in the text, it escapes itself by adding an additional backslash: \\One more thing: In order to make this compact but highly unstructured format easier for humans to read, a kind of header is often provided on the first line, containing only the names of the values that will appear in the following lines. This is particularly common in data exports with just one record type that do not have a record type identifier at the beginning. As an example:
In these cases, if we only want to process the actual data then we simply ignore the first line.
When humans process data, they seldom do so directly in CSV format.
Instead, they typically use Excel or an open-source equivalent (such as OO Calc). In Excel, you can export your tables at any time as CSV files, and if you double click on a .csv file in Windows then Excel will usually be the program assigned to open it. Excel does one thing differently when exporting to CSV: Instead of invalidating quotation marks with an escape character such as a \, it simply doubles them:
“This text contains “”quotation marks”” partway through.”
This is completely allowed!
So you should expect to receive files in this format too, from time to time. But your converter software should be able to cope.
As for the other variations:
CSV is used particularly frequently for compact exports or even imports into databases (for example). Because these export/import routines are generally bashed out very quickly, you will encounter a wide range of non-standard solutions. One developer might think that her values will never contain anything that needs to be wrapped in “ “, and as such, her routines will strictly refuse to read any files with “ “. Others will be excessively cautious, and will wrap even purely numerical values in “ “. You should therefore make sure that the software you are using to process CSV files is flexible not only in terms of the choice of special characters (separators, quotation marks and escape characters), but also when it comes to the rules for “quoting”.
Unlike fixed record, there are no major, pre-defined messaging standards that are based on CSV. As mentioned in the previous section, CSV is used especially often for quick and straightforward data imports and exports. This is because it is wonderfully compact, has the lowest overhead of any formats (although EDIFACT and X12 can compete here too, they are much more complex in terms of logic), and is incredibly simple.