Sampling text file with header

Post date: Mar 10, 2015 1:24:14 AM

This method works, but may not be efficient for a large file:

## The program randomly select 1M rows from a big file, returned with header.
# Get the header
head -1 input.csv > downsampled.csv
# remove the header
tail -n +2 input.csv > tmp.csv
# downsample the contents and add the header back
shuf -n 1000000 tmp.csv >> downsampled.csv
# remove the temp file
rm tmp.csv

Further resources:

http://stackoverflow.com/questions/9245638/select-random-lines-from-a-file-in-bash

http://stackoverflow.com/questions/339483/how-can-i-remove-the-first-line-of-a-text-file-using-bash-sed-script