The Problem
When importing and exporting data between databases and spreadsheets and other programs, we often drop into one of the well-known file formats like .csv, .txt, .xml or .json. While these formats are great, they often pose problems like escaping certain characters that are used as delimiters. What we need is a simple format that protects our text from programming collisions and is reliable.
What’s old is new again
If you look at the basic ASCII character set You’ll see a couple of interesting non-printable characters. They were originally used to help with moving serialized data between computers and storage media. I’m not sure why their use fell off or why our developer tooling hasn’t done much to help us work with them. In the table below, the “char” and “entity” columns are displaying information about the extended ascii printable representation of the non-printable control characters. Let’s throw in the End of File character as well.
How it could work
Imagine a csv file with column headings:
col1,col2,col3,col4
A,B,C,D
1,2,3,4
Now, think of the same file, but with unit separators where the commas used to be and a record separator where the carriage return is. To help with readability, I added a some carriage returns.
col1uscol2uscol3uscol4rs AusBusCusDrs 1us2us3us4rs my_worksheet/tablegs filename/databasefs eot
You could probably leave off the gs and fs characters if your file holds data for a single, unspecified table. It would still be nice to include the eot to give parsers a notification that they are done.
Header rows
Often, column names are listed as the first row of a .csv file. The same could be done with a .ccsv file where you indicate that the first row contains titles. There would not need to be any modification to the .ccsv format.
Editors
So this brings us to how our tooling should be set up. Since all of the separator characters are non-printable, they are generally invisible in basic editor programs.
I think that editors should probably implement two triggers. On on read and one on save.
When a text editor reads a .ccsv file, it should globally replace the separators with some visible glyph as I have done here in the example. Some carriage returns would also be helpful.
When a text editor saves a .ccsv file, it should do the opposite and put in the separator characters where appropriate.
Some work with cut and paste would also need to be addressed.
Visual editors would have to parse the data based on their needs, but removing any type of ambiguity about what goes in a row and what goes in a column should help.
What would this do for us?
Most programming languages already have some sort of char() function and some sort of replace() function making this type of file pretty easy to generate and work with.
Spreadsheet programs like excel and numbers could be set up to automatically detect .ccsv files and handle them appropriately without the need to specify delimiters.
What about multiple tables in one file.
Certainly possible with the table and database separators, but we can leave details about that for later.
What next
I’d love to see a VS Code extension to allow editing .ccsv files. Maybe I’ll try to trick convince somebody to build one.
UPDATE: 11/06/2019
After looking around the web a bit, it looks like trying to use .data as a file extension will cause some problems. That extension is a variant of a .dat file which is used as a generic data format by a bunch of different programs.
Instead, I’m thinking about using .ccsv (control character separated values) as a file extension. I couldn’t find anything else using that and I like that it suggests that it is an alternative to a .csv file since they perform similar functions.
I’ll stick with that for now unless it has some other sort of problem.
comments powered by DisqusTags