msa_compactor

msa compactor

Overview

msa_compactor is a simple command line tool to "shrink" multiple sequence aligments (MSAs) in length, by removing gap-inducing sequences.

msa_compactor is implemented in Java as part of the forester libraries.

Download

Usage

java -cp path\to\forester.jar org.forester.application.msa_compactor <options> <msa input file> <output file base>

options:

-r=<integer> number of worst offender sequences to remove

-l=<integer> target MSA length

-g=<decimal> target gap-ratio (0.0-1.0)

-a to realign using MAFFT (if path to MAFFT not found, use "-mafft=<path to MAFFT>" option)

-mo=<string> options for MAFFT (default: "--auto")

-s=<integer> step for output and re-aligning (default: 1)

-sd=<integer> step for diagnostics reports (default: 1)

-e to calculate normalized Shannon Entropy (for very large alignments, use -sd option to

speed up calculations)

-p to write output alignments in phylip format instead of fasta

-ro=<file> to output the removed sequences

-t to calculate an approximate phylogenetic tree of the original alignment

[Neighbor Joining on Kimura's distances (Kimura, 1983)]

Examples

Display of chart and phylogenetic tree only, no output produced:

% msa_compactor -e -t bcl2.aln

Display of chart only, no output produced, report after every 10 sequences have been removed:

% msa_compactor -e -sd=10 bcl2.aln

Display of chart only, no output produced, re-aligning with MAFFT after every 40 sequences have been removed (using MAFFT's most accurate settings "--maxiterate 1000 --localpair"):

% msa_compactor -e -s=40 -a -mo="--maxiterate 1000 --localpair" bcl2.aln

Removal of 20 most "gap inducing" sequences, use "bcl2_compact" as base for output:

% msa_compactor -e -r=20 bcl2.aln bcl2_compact

Removal of sequences until the total MSA length is less than 200aa, use "bcl2_compact" as base for output:

% msa_compactor -e -l=200 bcl2.aln bcl2_compact

Removal of sequences until the overall MSA gap ratio is less than 0.5, use "bcl2_compact" as base for output, re-aligning with MAFFT after every 40 sequences have been removed (using MAFFT's auto settings), report after every 5 sequences have been removed:

% msa_compactor -e -g=0.5 -s=40 -a -mo="--auto" -sd=5 bcl2.aln bcl2_compact

Removal of sequences until the overall MSA gap ratio is less than 0.5, use "bcl2_compact" as base for output, write removed to sequences to "removed_seqs":

% msa_compactor -e -g=0.5 -ro=removed_seqs bcl2.aln bcl2_compact

Last updated: 2015-01-21