7 - Tricks for processing massive fasta files for BLAST searches.
With the increasing use of next-generation sequencing, comes the problem of how to efficiently process massive query files for BLAST searches.
Strategies for conducting more efficient BLAST searches:
1. Reduce the number of reads you have to BLAST in the first place.
There are a number of ways you can reduce your dataset, such as filtering out short reads, longer than expected reads (454 pyrosequencing), poor quality reads, contaminant reads (ex. by mapping to human, cow, etc.), filtering for a taxon of interest (ex. by mapping to a reference genome), removing duplicate reads (dereplication), and clustering amplicon reads into operational taxonomic units (OTUs) if applicable.
2. Split large jobs into smaller jobs.
BLAST processes small files more quickly than large files. Splitting a large job into x number of smaller jobs may decrease the run time by more than x times.
One of my favorite unix commands is 'split'.
Ex. Start with a FASTA file of query sequences for a BLAST search where there is a header, followed by a one line sequence, no empty lines between entries.
$split -l 1000 BigFile.fasta
The output will be automatically named xaa-xzz. The -l indicates that each file will contain 1000 lines, or in our case, 500 fasta formatted sequences.
3. Run many small jobs in parallel.
One of my favorite GNU tools is 'parallel' [GNU parallel]. If you are working on a multi-core machine, you can very easily distribute numerous jobs of the same type.
$ls | grep '^x' | parallel -j 48 "blastn -task megablast -db /path/to/nt -query {} -out {}.blastn -evalue '1e-10' -outfmt 0 -num_descriptions 100 -num_alignments 100"
The output will be automatically named xaa.blastn - xzz.blastn. In this example, I've used -j 48 cores, but you should tailor this to the number of cores available on the machine you are working on.
The partial blast outfiles can then be quickly concatenated back into one file if necessary.
$ls | grep blastn | parallel -j 1 "cat {} >> concatenated.file"
4. BLAST against a local database, preferably on the scratch disk of your multi-core machine.
If you work on a cluster like I do, intensive BLAST searches can be slowed down tremendously by network traffic. Working directly from the scratch disk helps to get around this problem. I put a copy of the database I'm searching against on the scratch disk, along with all my infiles and outfiles. I pretty much work off the scratch disk all the time these days. Needless to say, I'm using a downloaded copy of BLAST+ to run local searches [BLAST+ executables]. Also, pick your reference database carefully. Do you really need to BLAST your fungal ITS sequences against the whole NCBI nucleotide database or would the curated fungal database at UNITE work just as well? [UNITE downloads] Remember, the bigger the reference database you query, the longer the search will take.
5. Use MEGAN4 to quickly review the taxonomic assignments of your BLAST searches.
My favorite way to get a quick overview of the taxonomic assignments from BLAST searches is to import the results into MEGAN4 (Huson et al., 2011). If you search the NCBI nucleotide database using blastn, then MEGAN can be used to easily summarize taxonomic profiles to variable taxonomic ranks. If you search against a RefSeq database, then MEGAN also provides KEGG and SEED maps for your results. MEGAN not only parses and summarizes BLAST reports, but can also read files from the Ribosomal Database Project Naive Bayesian Classifier for 16S or fungal LSU sequences.
6. Don't forget to archive your BLAST reports once you're done with them.
BLAST reports can take up a huge amount of space. Use gunzip, for example, to compress them to save space.
$tar -czvf smallFile.blastn.tar.gz bigFile.blastn
References
Abarenkov K, Nilsson RH, Larsson K-H, et al. (2010) The UNITE database for molecular identification of fungi - recent updates and future perspectives. New Phytologist, 186: 281-285.
Huson DH, Mitra S, Ruscheweyh HJ, Weber N, Schuster SC (2011) Integrative analysis of environmental sequences using MEGAN4. Genome Research, 21: 1552-1560.
Liu K-L, Porras-Alfaro A, Kuske CR, Elchorst SA, Xie G (2012) Accurate, rapid taxonomic classification of fungal large-subunit rRNA genes. Applied and Environmental Microbiology, 78: 1523-1533.
Wang Q, Garrity GM, Tiedje JM, Cole JR (2007) Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73: 5261-5267.