TSV - Tab Separated Values: My Favorite Format
Posted: | More posts about pythonTSV (Tab-Separated Values) is a simple text format for tabular data. I'll describe how I work with TSV files on the command line and why TSV is my favorite format for tabular data.
What is TSV?
"TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas – literal commas are very common in text data, but literal tab stops are infrequent in running text." - Wikipedia.
The IANA definition describes the actual file format:
A tsv file encodes a number of records that may contain multiple fields. Each record is represented as a single line. Each field value is represented as text. Fields in a record are separated from each other by a tab character.
Note that fields that contain tabs are not allowable in this encoding.
How does a TSV file look like? We can create a TSV file with 4 records on the command line:
$ echo -e "Name\tAge\tAddress" > records.tsv $ echo -e "Paul\t23\tBerlin\nEmily\t32\tCopenhagen\nAliyah\t27\tHamburg" >> records.tsv $ echo -e "John Doe\t42\tUniverse" >> records.tsv $ cat records.tsv Name Age Address Paul 23 Berlin Emily 32 Copenhagen Aliyah 27 Hamburg John Doe 42 Universe
The last record (name longer than 7 characters, default tab stop is 8 characters) makes the table look ragged, i.e. columns are not align. To print a TSV file with columns aligned, we can use the "column" command (installed on Debian/Ubuntu by default):
$ column -t -s$'\t' records.tsv Name Age Address Paul 23 Berlin Emily 32 Copenhagen Aliyah 27 Hamburg John Doe 42 Universe
The -t
option tells column
that the input file is already in tabular format. The -s
option sets the separator to TAB.
The string $'\t'
is a Bash construct to expand backslash-escaped characters (\t
for TAB) as specified by the ANSI C standard.
UNIX Power Tools for TSV
Standard UNIX tools can be leveraged to work with TSV files.
The cut
command is handy to extract certain columns.
cut
uses TAB as the default delimiter:
We can sort records by a given column, e.g. sort by age (second column):
$ sort -n -t$'\t' -k2 records.tsv | column -t -s$'\t' Name Age Address Paul 23 Berlin Aliyah 27 Hamburg Emily 32 Copenhagen John Doe 42 Universe
GNU AWK is pretty powerful and by setting the delimiter to TAB (-F$'\t'
), we can use it to calculate the average age:
Given a JSON file, we can use jq to convert JSON arrays to TSV via @tsv
:
$ echo '[{"id":1,"name":"a"},{"id":2,"name":"b"}]' > data.json $ cat data.json | jq '.[]|[.id, .name]|@tsv' -r
My favorite text editor is Vim. Vim can be used to edit TSV files, but we should make sure to enable "list mode" via :set list
to show TAB characters:
There are command line tools specialized for CSV/TSV data, e.g. eBay has a nice collection of Open Source TSV utilities.
Spreadsheet Import & Export
Being tabular data, .tsv
files can be directly opened with LibreOffice Calc:
We can also directly copy & paste TSV data into a spreadsheet, so TSV is "clipboard friendly".
Google Sheets allows to download sheets ("tabs") as .tsv
files:
Reading & Writing TSV with Python
Python's standard library contains the csv
module which can read and write the TSV format.
Reading TSV records as dictionaries with csv.DictReader
:
import csv with open("records.tsv") as fd: reader = csv.DictReader(fd, dialect="excel-tab") for row in reader: print(f"{row['Name']} is {row['Age']} years old!")
Writing a TSV file works similarly with csv.DictWriter
:
import csv field_names = ["Name", "Age", "Address"] people = [ {"Name": "John Doe", "Age": 42}, {"Name": "Lekisha Rhymer", "Age": 30} ] with open("out.tsv", "w") as fd: writer = csv.DictWriter(fd, dialect="excel-tab", fieldnames=field_names) writer.writeheader() for person in people: writer.writerow(person)
Summary
When working with tabular data (10-100k rows), tab-separated values (TSV) is my favorite data format. TSV is widely supported and allows quick iteration and experimentation with standard UNIX tools. CSV is often problematic as data tends to contain commas (e.g. in "Lastname, Firstname"), but tab stops in data can usually be avoided.
My Open Source projects provide export to TSV out of the box for convenience:
Kubernetes Resource Report allows downloading table data as TSV (e.g. list of applications, Pods), e.g. for processing and charting cost data in spreadsheets
Kubernetes Web View provides TSV export of listed Kubernetes resources, e.g. for further analysis of cluster objects