TSV - Tab Separated Values: My Favorite Format

Posted: 2020-07-12 09:09 | More posts about python

TSV (Tab-Separated Values) is a simple text format for tabular data. I'll describe how I work with TSV files on the command line and why TSV is my favorite format for tabular data.

What is TSV?

"TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas – literal commas are very common in text data, but literal tab stops are infrequent in running text." - Wikipedia.

The IANA definition describes the actual file format:

A tsv file encodes a number of records that may contain multiple fields. Each record is represented as a single line. Each field value is represented as text. Fields in a record are separated from each other by a tab character.

Note that fields that contain tabs are not allowable in this encoding.

How does a TSV file look like? We can create a TSV file with 4 records on the command line:

$ echo -e "Name\tAge\tAddress" > records.tsv
$ echo -e "Paul\t23\tBerlin\nEmily\t32\tCopenhagen\nAliyah\t27\tHamburg" >> records.tsv
$ echo -e "John Doe\t42\tUniverse" >> records.tsv
$ cat records.tsv
Name    Age     Address
Paul    23      Berlin
Emily   32      Copenhagen
Aliyah  27      Hamburg
John Doe        42      Universe

The last record (name longer than 7 characters, default tab stop is 8 characters) makes the table look ragged, i.e. columns are not align. To print a TSV file with columns aligned, we can use the "column" command (installed on Debian/Ubuntu by default):

$ column -t -s$'\t' records.tsv
Name      Age  Address
Paul      23   Berlin
Emily     32   Copenhagen
Aliyah    27   Hamburg
John Doe  42   Universe

The -t option tells column that the input file is already in tabular format. The -s option sets the separator to TAB. The string $'\t' is a Bash construct to expand backslash-escaped characters (\t for TAB) as specified by the ANSI C standard.

UNIX Power Tools for TSV

Standard UNIX tools can be leveraged to work with TSV files. The cut command is handy to extract certain columns. cut uses TAB as the default delimiter:

$ cut -f 3 records.tsv
Address
Berlin
Copenhagen
Hamburg
Universe

We can sort records by a given column, e.g. sort by age (second column):

$ sort -n -t$'\t' -k2 records.tsv | column -t -s$'\t'
Name      Age  Address
Paul      23   Berlin
Aliyah    27   Hamburg
Emily     32   Copenhagen
John Doe  42   Universe

GNU AWK is pretty powerful and by setting the delimiter to TAB (-F$'\t'), we can use it to calculate the average age:

$ awk -F$'\t' '{ sum += $2 }; END { print sum / (NR-1) }' records.tsv
31

Given a JSON file, we can use jq to convert JSON arrays to TSV via @tsv:

$ echo '[{"id":1,"name":"a"},{"id":2,"name":"b"}]' > data.json
$ cat data.json | jq '.[]|[.id, .name]|@tsv' -r

My favorite text editor is Vim. Vim can be used to edit TSV files, but we should make sure to enable "list mode" via :set list to show TAB characters:

There are command line tools specialized for CSV/TSV data, e.g. eBay has a nice collection of Open Source TSV utilities.

Spreadsheet Import & Export

Being tabular data, .tsv files can be directly opened with LibreOffice Calc:

We can also directly copy & paste TSV data into a spreadsheet, so TSV is "clipboard friendly".

Google Sheets allows to download sheets ("tabs") as .tsv files:

Reading & Writing TSV with Python

Python's standard library contains the csv module which can read and write the TSV format. Reading TSV records as dictionaries with csv.DictReader:

import csv

with open("records.tsv") as fd:
    reader = csv.DictReader(fd, dialect="excel-tab")
    for row in reader:
        print(f"{row['Name']} is {row['Age']} years old!")

Writing a TSV file works similarly with csv.DictWriter:

import csv

field_names = ["Name", "Age", "Address"]
people = [
    {"Name": "John Doe", "Age": 42},
    {"Name": "Lekisha Rhymer", "Age": 30}
]

with open("out.tsv", "w") as fd:
    writer = csv.DictWriter(fd, dialect="excel-tab", fieldnames=field_names)
    writer.writeheader()
    for person in people:
        writer.writerow(person)

Summary

When working with tabular data (10-100k rows), tab-separated values (TSV) is my favorite data format. TSV is widely supported and allows quick iteration and experimentation with standard UNIX tools. CSV is often problematic as data tends to contain commas (e.g. in "Lastname, Firstname"), but tab stops in data can usually be avoided.

My Open Source projects provide export to TSV out of the box for convenience:

Kubernetes Resource Report allows downloading table data as TSV (e.g. list of applications, Pods), e.g. for processing and charting cost data in spreadsheets
Kubernetes Web View provides TSV export of listed Kubernetes resources, e.g. for further analysis of cluster objects

Kubernetes Resource Report: download as TSV