Smart batch processing using make
A common situation in data analysis is having a bunch of files that need to be processed - for example, fasta files of DNA sequences that need to be aligned. Each fasta file needs to be input to a program that outputs an alignment, e.g.:
mafft gene001.fasta > gene001.aln
One solution is to use a for loop in the shell with wildcard expansion:
for infile in *.fasta ; do
mafft $infile > $(basename $infile .fasta).aln ;
done
A quicker way is to process the files in parallel, here using 4 cores:
parallel -j 4 mafft {} > {.}.aln ::: *.fasta
However, what happens if, after you do this, some new fasta files are added, or some are changed, and need to be realigned? Running either of the commands above would realign all the fasta files, including those that don't need to be realigned.
It would be nice if there was a way to align just the files that either don't have a corresponding alignment, or files where the alignment is out of date (older than the fasta file).
make to the rescue
make is well known as a tool for compiling source code, but it can be used here, too. All that is needed is a simple Makefile with the following lines (numbered for reference):
%.aln: %.fasta
mafft $< > $@
fastas := $(wildcard *.fasta)
alns := $(fastas:.fasta=.aln)
all: $(alns)
Here, lines 1-2 say that to create a target file (.aln extension) from a source file (.fasta extension), run mafft, with the source file indicated by $< and the target by $@. Line 3 assigns all files with a .fasta extension to the variable fastas. Line 4 assigns corresponding alignment file names to the variable alns. Line 5 says that these alignment files are the default target all - meaning that creating these files from their sources is the objective.
Now, with this Makefile in the same directory as the fasta files, simply running make will align only the fasta files that need to be aligned - i.e. if the corresponding alignment is missing or older than the fasta file. Even better, you get parallelization for free - e.g. make -j 4 will use 4 cores.