msa_compactor
msa compactor
Overview
msa_compactor is a simple command line tool to "shrink" multiple sequence aligments (MSAs) in length, by removing gap-inducing sequences.
msa_compactor is implemented in Java as part of the forester libraries.
Download
Most current version (might be unstable): forester.jar
Source code is available at GitHub: https://github.com/cmzmasek/forester
Usage
java -cp path\to\forester.jar org.forester.application.msa_compactor <options> <msa input file> <output file base>
options:
-r=<integer> number of worst offender sequences to remove
-l=<integer> target MSA length
-g=<decimal> target gap-ratio (0.0-1.0)
-a to realign using MAFFT (if path to MAFFT not found, use "-mafft=<path to MAFFT>" option)
-mo=<string> options for MAFFT (default: "--auto")
-s=<integer> step for output and re-aligning (default: 1)
-sd=<integer> step for diagnostics reports (default: 1)
-e to calculate normalized Shannon Entropy (for very large alignments, use -sd option to
speed up calculations)
-p to write output alignments in phylip format instead of fasta
-ro=<file> to output the removed sequences
-t to calculate an approximate phylogenetic tree of the original alignment
[Neighbor Joining on Kimura's distances (Kimura, 1983)]
Examples
Display of chart and phylogenetic tree only, no output produced:
% msa_compactor -e -t bcl2.aln
Display of chart only, no output produced, report after every 10 sequences have been removed:
% msa_compactor -e -sd=10 bcl2.aln
Display of chart only, no output produced, re-aligning with MAFFT after every 40 sequences have been removed (using MAFFT's most accurate settings "--maxiterate 1000 --localpair"):
% msa_compactor -e -s=40 -a -mo="--maxiterate 1000 --localpair" bcl2.aln
Removal of 20 most "gap inducing" sequences, use "bcl2_compact" as base for output:
% msa_compactor -e -r=20 bcl2.aln bcl2_compact
Removal of sequences until the total MSA length is less than 200aa, use "bcl2_compact" as base for output:
% msa_compactor -e -l=200 bcl2.aln bcl2_compact
Removal of sequences until the overall MSA gap ratio is less than 0.5, use "bcl2_compact" as base for output, re-aligning with MAFFT after every 40 sequences have been removed (using MAFFT's auto settings), report after every 5 sequences have been removed:
% msa_compactor -e -g=0.5 -s=40 -a -mo="--auto" -sd=5 bcl2.aln bcl2_compact
Removal of sequences until the overall MSA gap ratio is less than 0.5, use "bcl2_compact" as base for output, write removed to sequences to "removed_seqs":
% msa_compactor -e -g=0.5 -ro=removed_seqs bcl2.aln bcl2_compact
Last updated: 2015-01-21