Binning papers

Statistical Binning

Mirarab, Siavash, Md. Shamsuzzoha Bayzid, Bastien Boussau, and Tandy Warnow. “Statistical Binning Enables an Accurate Coalescent-Based Estimation of the Avian Tree.” Science 346, no. 6215 (December 12, 2014): 1250463–1250463. doi: http://dx.doi.org/10.1126/science.1250463

The statistical binning dataset is permanently provided at UIUC under http://dx.doi.org/10.13012/C5MW2F2P currently at https://www.ideals.illinois.edu/handle/2142/55319
Some files are missing from that link and are provided here. The README file shown below gives the details.

README

Weighted Statistical Binning

Bayzid, Md. Shamsuzzoha, Siavash Mirarab, Bastien Boussau, and Tandy Warnow. “Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses.” PLoS ONE 10, no. 6 (January 18, 2015): e0129183. doi:10.1371/journal.pone.0129183.

Most of the datasets used in this study are available through the prior publication (statistical binning).
The new datasets generated for this study are available on figshare, with DOI: http://dx.doi.org/10.6084/m9.figshare.1411146.
The weighted statistical binning software is available on github at https://github.com/smirarab/binning.
Some of the file from the FigShare are also given below.

Response to comment on statistical binning.

Mirarab, Siavash, Md. Shamsuzzoha Bayzid, Bastien Boussau, and Tandy Warnow. “Response to Comment on ‘Statistical Binning Enables an Accurate Coalescent-Based Estimation of the Avian Tree.’” Science 350, no. 6257 (October 9, 2015): 171. doi:10.1126/science.aaa7719.

Datasets

The following files are provided:

MCcoal.zip: includes the MCcoal control files and the seeds used
Alignments, true gene trees, and estimated gene trees for the 10-taxon and 15-taxon datasets (the 5-taxon datasets were simulated by Liang Liu and Scott Edwards (L&E) and provided to us):
- Gene alignments: t10_t15_gene_alignments.tar.bz2
- True gene trees: t10_t15_true_gene_trees.tar.bz2
- Estimated gene trees: t10_t15_estimated_genetrees.tar.bz2
Supergene alignments and trees:
- Supergene alignments: for each supergene, the alignment and the partition file is given at: supergene_alignments.tar.bz2
- Supergene trees: supergene_trees.tar.bz2
MP-EST results (including the input used for each MP-EST run): MPEST_input_and_output.tar.bz2

Simulation Procedure

Gene trees were simulated using MCcoal, with control files given in our dataset. These control files define the species tree. The species trees are in the caterpillar form (shown below):

10 species:

(((((((((A #.05,B #.05):0.005 #.05,C #.05):0.01 #.05, D #.05):0.015 #.05,E #.05):0.02 #.05,F #.05):0.025 #.05,G #.05):0.03 #.05,H #.05):0.035 #.05,I #.05):0.04 #.05, J #.05):0.54 #.05;

15 species:

((((((((((((((A #.05,B #.05):0.005 #.05,C #.05):0.01 #.05, D #.05):0.015 #.05,E #.05):0.02 #.05,F #.05):0.025 #.05,G #.05):0.03 #.05,H #.05):0.035 #.05,I #.05):0.04 #.05,J #.05):0.045 #.05,K #.05):0.05 #.05,L #.05):0.055 #.05,M #.05):0.06 #.05,N #.05):0.065 #.05,O #.05):0.565 #.05;

To run MCcoal, the following command was used:

printf "10000 1000" | PATH_TO_MCCOAL/MCcoal

This simulated 10,000 gene trees, which we divided into 10 replicates of 1000 genes each, and similarly 10 replicates of 100 genes each.

For each true gene tree, we then simulated alignments using bppseqgen, using the following command:

mkdir allTrees

split -a 4 -l 1 out.trees

for i in x* ; do mv $i allTrees/ ; done

for i in allTrees/x* ; do bppseqgen number_of_sites=1000 input.tree.file=$i param=bpp.options output.sequence.file=$i".fasta" ; done

The file bpp.options is the same as what was used in our statistical binning paper (Mirarab et al., Science 2014):

# Substitution model parameters:

model = GTR(a=1.062409952497, b=0.133307705766, c=0.195517800882, d=0.223514845018, e=0.294405416545, theta=0.469075709819, theta1=0.558949940165, theta2=0.488093447144)

# Rate distribution parameters:

rate_distribution = Gamma(n=4, alpha=0.370209777709)

Estimating gene trees

To estimate gene trees, we used RAxML, version 8.0.19. We used the following commands.

Unbinned gene trees (unpartitioned analyses):
- maximum likelihood analyses:
- raxmlHPC-8.0.19-SSE3 -m GTRGAMMA -n best -s [alignment_file] -N 10 -p [random_seed_number]
- bootstrapping
- raxmlHPC-8.0.19-SSE3 -m GTRGAMMA -n ml -s [alignment_file] -N 100 -b [random_seed_number] -p [random_seed_number]
- drawing support values onto the maximum likelihood tree
- raxmlHPC-8.0.19-SSE3 -f b -m GTRGAMMA -n final.f100 -z RAxML_bootstrap.ml -t RAxML_bestTree.best
Supergene trees (partitioned analyses):
- bootstrapping
- raxmlHPC-8.0.19-SSE3 -m GTRGAMMA -n ml -s [alignment_file] -N 100 -b [random_seed_number] -p [random_seed_number] -M -q supergene.part

Note that the partition files are provided as part of our dataset.

Binning procedure

To perform binning, we used a pipeline available on gitub. As input to the binning pipeline, we used the RAxML_bipartitions.final files produced as part of our estimation of unbinned gene trees. We varied the bootstrap support threshold (25%, 50%, and 75%). Each supergene tree is computed using a fully partitioned analysis, which means that all parameters (GTR matrix and branch lengths) - other than the tree topology - are independently estimated for each gene within each supergene concatenated sequence alignment.

For the 75% threshold, we also ran some tests where we investigated the impact of breaking ties in the binning code in multiple ways. To force the binning pipeline to break ties differently from one run to another, we modified the pipeline to create random orderings of gene names that are used as input to the pipeline. Thus, instead of

ls| grep -v ge|sed -e "s/.50$//g"> genes

we used:

ls| grep -v ge|sed -e "s/.50$//g"|sort -R > genes

Species tree estimation

To estimate species trees, we followed the following steps.

We rooted gene trees (using custom scripts based on dendropy) on the outgroup (E for 5-taxon, J for 10-taxon, and O for 15-taxon)
We created 100 bootstrap replicate inputs for MP-EST. To do this, we matched the first bootstrap replicate of each input (super-) gene tree to build input 1 (BS.1), matched second bootstrap replicate to create input 2 (BS.2), and so on. These files (BS.1, BS.2, ..., BS.100) are provided as part of our dataset.
We ran MP-EST on each input (BS.i). We ran MP-EST 10 times on each input and picked the result with the highest pseudo-likelihood value. The final tree from the best run is provided in our datasets
We estimated the greedy consensus of all 100 bootstrap runs, and used this consensus as the estimate of the species tree.

BBCA

Zimmermann, Théo, Siavash Mirarab, and Tandy Warnow. “BBCA: Improving the Scalability of *BEAST Using Random Binning.” BMC Genomics 15, no. Suppl 6 (October 17, 2014): S11. doi:10.1186/1471-2164-15-S6-S11.

All the archive files in the following list include README files that describe their content.

sim.laura.hom.zip: Simulated Laurasiatheria dataset: This archive includes model tree, fasta sequences, true gene trees, and FastTree and RAxML estimated gene trees.
beast_input_sim.11taxon.zip: *BEAST input files - 11-taxon dataset
beast_input_sim.laura.hom.zip: *BEAST input files - Laura. dataset
create_beast_input.pl: *BEAST Script: a script used for building *BEAST input files.

Report abuse