Basic Data Analysis at the Unix shell
I often prefer the shell and Unix utilities to having to wait to load data into a relational database or MS Access. There are plenty of cases when an RDBMS is a better choice – especially when what you’re doing requires joins. At the shell it’s often possible to not even have to transform the encoding of the files before analyzing them. I have developed a couple of recipes for doing some SQL equivalents at the shell. These are a few that I just used so they’re fresh in my mind. Most of the time all it takes is a bit of imagination about how to create a simple data-flow by learning and composing a small handful of the ubiquitous Unix utilities.
All these examples will also work within the Cygwin environment for Windows, or at a Terminal in OS X (especially when combined with the additional software available via Fink or Mac Ports projects.
Counting Records
"SELECT COUNT(*) FROM TABLE"
Just selecting the count of records from an input file is one of the easiest things to accomplish (if your file is already line-oriented). The wc, or word count, utility can do this easily. By default it counts characters, words and lines. With ‘-l’ it will emit only the count of lines.
user@host:~/data$ wc -l table.tab 10
If you want to ignore the header, start with the second line (see the next example for a more thorough explanation):
user@host:~/data$ tail -n +2 | table.tab 9
Counting Distinct Values from a Column
"SELECT COUNT(DISTINCT(FIELD1)) FROM TABLE"
For getting a distinct count of values in a column:
user@host:~/data$ cut -f1 table.tab | tail -n +2 | sort | uniq -c
This operations starts with the cut utility. cut allows you to take particular columns from a tab-delimited file, or character ranges from a fixed-width file. cut also allows you to specify the delimiter – but be warned that the commonly encountered CSV format is more complex than cut can handle (CSV can support embedded delimiters and quote characters, which are beyond the scope of what cut attempts to handle). The usage of cut takes the first column out of the input file.
The next part is the tail command. tail outputs the end or ‘tail’ of a file. The ‘-n’ option instructs tail to emit a specific number of lines (counted from the end of the file) when no ‘+’ sign is present on the number. The ‘+’ tells tail to start at the second line from the beginning (in stead of from the end). This effectively discards the header line.
Then the values from the first column themselves are sorted. This is necessary for the uniq command, which will only collapse or count duplicate lines when they are adjacent.
Finally we reduce duplicate lines with uniq. The ‘-c’ tells uniq to emit the count of duplicates when collapsing them.
Dealing with various file archive types
I often work with files in zip archives and tar (unix tape archive) archives, sometimes with additional compression applied to them (.Z, unix compress; .gz, gzip; and .bz2, bzip). It is possible to work with these files without having to unarchive or decompress them permanently if all you need is a simple count of lines or to only process them once.
Pulling a file from a Zip Archive
To pull one or more files from within a zip archive, and send them to another command (as part of a pipeline):
user@host:~/data$ unzip -l archive.zip Archive: archive.zip Length Date Time Name -------- ---- ---- ---- 34 04-28-08 11:01 table1.tab 56 04-28-08 11:01 table1.tab user@host:~/data$ unzip -c archive.zip table1.tab table2.tab | wc -l 36
That example uses the unzip command to pull 2 files out and send them to standard output (the -c option to unzip instructs it to print the files to standard out rather than extract them to the file system). Those two files were then sent to wc to get the combined line (record) count for the two files. We don’t have to worry about cleaning up the two files after the command has completed since they were never written to disk.