PASTA/UPP

PASTA

Mirarab, Siavash, Nam Nguyen, Sheng Guo, Li-San Wang, Junhyong Kim, and Tandy Warnow. “PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences.” Journal of Computational Biology 22, no. 5 (2015): 377–86. doi:10.1089/cmb.2014.0156.
Mirarab, Siavash, Nam Nguyen, and Tandy Warnow. “PASTA: Ultra-Large Multiple Sequence Alignment.” Edited by Roded Sharan. Research in Computational Molecular Biology, 2014, 177–91.

PASTA (Practical Alignment using SATé and TrAnsitivity) is an improvement to SATé: it uses some of the algorithmic design of SATé but is faster, produces more accurate alignments and trees, and can scale to much larger datasets. PASTA computes alignments on very large datasets using a divide-and-conquer technique, as follows. It divides the dataset into smaller and evolutionary less diverged subsets, gets alignments on those subsets, merges some pairs of these subset alignments to get a set of overlapping and compatible alignments, and finally uses transitivity to merge all these overlapping alignments and produce a final alignment. The novel transitivity-based merge technique allows PASTA to be very scalable, but also improves its accuracy compared to SATé, its predecessor technique.

Software

PASTA code is available from github. The README file gives the detailed installation instructions (which are pretty simple).
For using the MAC package:
1. The .dmg files for MAC application are available below. Download the latest version of the MAC application .dmg file.
2. Open the .dmg file and copy its content to your preferred destination (do not run PASTA from the image itself).
3. Simply run the PASTA app from where you copied it.

If you have any trouble with this, please go to the PASTA tutorial, at https://github.com/smirarab/pasta/blob/master/pasta-doc/pasta-tutorial.md.

VM Image (mostly for Windows users) is available here for download. Note that the VM image is 1.7 GB and can take a long time to download. Once the image is downloaded, you need to run it using a VM environment. If you don't have a virtual machine environment, VirtualBox is a good option. It's free and easy to use. Download VirtualBox and install it on your machine. After you install VirtualBox, you just need to use File/import to import the Phylolab.ova image that you have downloaded. When importing the VM image, you are given a set of options that you can tweak. The VM image tries to allocate 1GB of RAM by default. If your machine has 4GB or more of RAM, that default value should be fine. If you have less than that, you might wish to reduce the memory to something like 512MB, but that could affect the maximum dataset size you can analyze using PASTA. You can always modify this value later. Once VM is imported, you can start it from the Virtualbox. If you are asked to login, the username and passwords are (username: phylolab, password: phylolab). PASTA is already installed on the VM machine, so you can simply proceed by opening a terminal and running it using run_pasta.py.

Datasets

This google drive above includes the following files:

The 10 largest AA datasets (which are called "small" because they are smaller than other datasets): small_10_aa.zip.
The HomFam datasets: homfam.zip.
The FastTree COG datasets: cog.zip.
The Indelible 10K datasets: indelible.zip.
1000-taxon simulated datasets are available at SATé-I website.
The three 16S RNA biological datasets (16S.3, 16S.T, and 16S.B.ALL) can be found at this page.
- However, we also used thresholds other than 75% for these datasets. The reference biological datasets without edge contraction can be found in the file guttell-bootstrap.zip.
The RNASim dataset is obtained by creating random subsets of the RNASim dataset created by S. Guo, L.-S. Wang, and J. Kim and described here. True alignments and tree are given in the file rnasim.tbz for our random subsamples.

Contact: All questions and inquires should be addressed to our user email group: pasta-users@googlegroups.com

UPP

Nguyen, Nam-phuong D., Siavash Mirarab, Keerthana Kumar, and Tandy Warnow. “Ultra-Large Alignments Using Phylogeny-Aware Profiles.” Genome Biology 16, no. 1 (December 16, 2015): 124. doi:10.1186/s13059-015-0688-z.

UPP (Ultra-large alignments using Phylogeny-aware Profiles) is a new method for the alignment of large and potentially fragmentary datasets. UPP takes as input a set of unaligned sequences and partitions the sequences into a "backbone set" (up to 1,000 sequences) and a "query set". PASTA is used to produce an alignment and tree on the backbone set, and these are then called the "backbone alignment" and "backbone tree". The sequences in the query set are then added to the backbone alignment set using the Ensemble of HMMs technique presented in the paper describing UPP.

Download

UPP uses the code from SEPP to perform ultra-large alignments, and thus requires SEPP to be installed. The SEPP code and installation instructions are available from github. The README file gives the detailed installation instructions (which are pretty simple).
UPP also uses PASTA to generate the backbone alignment and tree. PASTA is available from github. The README file gives the detailed installation instructions for installing PASTA.
The readme for configuring and running UPP is available here.

Data

The UPP paper uses all the datasets from PASTA shown above. In addition, below, we provide:

The fragmentary versions of the ROSE NT, CRW 16S, RNA 10K, and Indel. 10K datasets are available fragmentary.zip.
An Excel file containing the results for each of the methods that were run on the datasets can be found at results.xls.
Supplementary materials are available upp_supp.pdf.

All questions should be addressed to Nam-phuong Nguyen (namphuon@illinois.edu), Siavash Mirarab (smirarab@gmail.com), or Tandy Warnow (warnow@illinois.edu).

Report abuse