Sampling text file with header

Post date: Mar 10, 2015 1:24:14 AM

This method works, but may not be efficient for a large file:

## The program randomly select 1M rows from a big file, returned with header.

# Get the header

head -1 input.csv > downsampled.csv

# remove the header

tail -n +2 input.csv > tmp.csv

# downsample the contents and add the header back

shuf -n 1000000 tmp.csv >> downsampled.csv

# remove the temp file

rm tmp.csv

Further resources:

http://stackoverflow.com/questions/9245638/select-random-lines-from-a-file-in-bash

http://stackoverflow.com/questions/339483/how-can-i-remove-the-first-line-of-a-text-file-using-bash-sed-script

Google Sites

Report abuse