Teresita M. Porter - Work with compressed sequence files

15 - Working with compressed files saves disk space

Bioinformatic processing of high throughput sequencing data uses batches of large files as input and creates large batches of out files at nearly every processing step--that can quickly consume lots of disk space!

Step 1 - hundreds of compressed sequence files

Step 2 - pair reads, creating hundreds of paired read files

Step 3 - trim adapters/primers, creating hundreds more read files, etc.

Work with the compressed *.fastq.gz files whenever you can. Many programs will allow you to use these for input and will print compressed output files as well. I do this routinely with SEQPREP for pairing R1 and R2 files, as well as with CUTADAPT when trimming primers.

To peak into a compressed file use zcat:

$ zcat file.fastq.gz | head

$ zcat file.fastq.gz | tail

To count lines in a compressed file:

$ zcat file.fastq.gz | wc -l

To print to another file (decompress): (if you can't use gunzip for some reason)

$ zcat file.fastq.gz > file.fastq