eMPRess

What is eMPRess

eMPRess is a software tool for reconciling pairs of phylogenetic trees such as host-parasite, host-symbiont, and species-gene trees under the Duplication-Transfer-Loss (DTL) model. The eMPRess tool was developed at Harvey Mudd College and is the successor for our Jane reconciliation tool. eMPRess has many features that are based on new and efficient algorithms. Read more about those features below.

eMPRess takes two undated binary phylogenetic trees (e.g., host and parasite, host and symbiont, species and gene) and an association of their tips as input. eMPRess addresses several important issues that are generally not supported in other existing tools. Among them are:

  1. Choosing the event costs is notoriously difficult. Different choices of the costs for duplication, transfer, and loss events can give rise to very different reconciliations and, consequently, very difficult conclusions. eMPRess helps guide the user in selecting event costs by computing and displaying "event cost regions" that show the different choices of event costs and their impacts on the resulting solutions. This feature allows users to systematically explore the space of event costs.

  2. The number of MPRs, even for a fixed set of event costs, can be extremely large (e.g, in the billions or more, even for trees with several tens of tips). A difficult problem, therefore, is selecting one or more MPRs that best represent the potentially huge solution space. eMPRess provides tools for visualizing the space of MPRs, clustering that space into "similar" MPRs, and finding a best representative MPR in each cluster.

  3. eMPRess computes support values for each event in each reconciliation that it displays. The support value of an event is the fraction of MPRs that contain that event. eMPRess computes these support values exactly rather than by sampling.

  4. eMPRess maintains and expands on many features in Jane, including both a graphical user interface and a command line interface, visualizations of reconciliations, and the ability to save these visualizations as high-quality images for use in publication.


eMPRess video - tutorial and use case

This 20-minute video tutorial provides a brief primer on the reconciliation problem and demonstrates the eMPRess workflow and functionality.

A touch of theory (recommended before getting started)

Maximum Parsimony Reconciliation

eMPRess, Jane, and most other reconciliation tools use a maximum parsimony approach for finding a "best" mapping of the parasite/symbiont tree) onto the host tree. In this formulation, each type of event: duplication, transfer, and loss, have a non-negative cost specified by the user. The objective is to find reconciliation that minimizes the total cost of the constituent events weighted by their event costs. Cospeciation is considered a "null event" and therefore has cost preset to zero.

Event Costs

Event costs are notoriously difficult to estimate. Many tools have default event costs (e.g., Jane's defaults are 1, 2, and 1 for duplication, transfer, and loss, respectively) and studies are often performed using just the default values. However, different event costs can lead to different solutions and thus different conclusions. For example, if one event has a much lower cost than others, a maximum parsimony reconciliation is likely to favor solutions with more of those kinds of events.

eMPRess's "View cost space" feature uses a technique called Pareto-optimal event counts to show you the impact of different event costs and to allow you to select event costs in a principled and systematic way. Specifically, note that event costs are just relative amounts; there no intrinsic meaning to a unit of cost, so choosing duplication, transfer, and loss of 2, 3, 1 respectively is the same as choosing costs of 200, 300, and 100; the ratios of the costs are the same in both cases. The "View cost space" feature in eMPRess fixes the cost of a loss at 1.0 and then examines the range of costs of duplication and transfer events relative to this cost of 1 for losses. (Recall that cospeciation is a null event and thus has a fixed cost of zero.)

The plot that is displayed by "View cost space" divides up the duplication and transfer cost space into color-coded regions. For any combination of costs in the same region, we will get the same set of MPRs. In other words, in a given color-coded region, it suffices to choose just one point - that is one combination of costs. For example, see this figure which is the event cost regions for the gopher-louse dataset.

Dated versus Undated Trees

eMPRess, Jane, and many other tools assume that the trees are undated. That is, while branch lengths may be provided in the newick input files, they are not used in the reconciliation process. Branch lengths - if given - are not assumed to correspond to actual dates when speciation events occurred in the host and parasite/symbiont trees.

Time-Consistency

While a parent node in a tree clearly occurred before its children, the order of the two children is assumed not to be known. In general, the order of nodes that are not ancestrally related to one another is not known. Consider a reconciliation of a parasite/symbiont tree onto a host tree and consider any particular parasite/symbiont species node p. That node p is mapped by the reconciliation to some host node h (or, perhaps, to the edge terminating at h). Clearly, no descendant of p should be mapped by the reconciliation to an ancestor of h. Any reconciliation that satisfies this condition is said to be weak time-consistent.

Why "weak"? There is another constraint that we also wish to satisfy and this one has to do with transfer (aka host switch) events and the fact that the trees are undated. When a transfer event occurs involving a parasite node p, one of its children, say p', is transferred to a branch in the host tree that is not ancestrally related (that is, not an ancestor nor a descendant) to the host on which p is mapped. We say that p "takes off" from the host branch on which it resides and that p' "lands" on a branch somewhere else on the tree. The place where p' lands is called the "landing site."

Because the tree is not dated, we don't know if the landing site is contemporaneous with the take-off site. In theory, the take-off and landing sites should be contemporaneous, but there's no way to know for sure. We say that a reconciliation is strong time-consistent if it is not only weak time-consistent but also if there exists some ordering of the internal nodes of the host tree that guarantee that for every transfer event, the take-off and landing sites are contemporaneous.

Ideally, we would like strongly time-consistent reconciliations. Here is some good news and bad news: Even finding weakly time-consistent maximum parsimony reconciliations is computationally intractable (NP-hard). Jane uses a heuristic that only considers strongly time-consistent reconciliations, but doesn't guarantee that they are truly maximum parsimony reconciliations (i.e., their total events costs may be higher than optimal). eMPRess, and most other tools, use much faster exact algorithms that do guarantee maximum parsimony but with the possibility that the resulting reconciliations are not time-consistent. eMPRess, however, checks each solution that it finds and indicates whether it is strongly time-consistent (the best outcome), weakly time-consistent, or not even weakly time-consistent.

Dealing with many MPRs

The number of MPRs for a given dataset and a fixed set of event costs can be huge. In some datasets that we have explored, there have been more than 10e50 (1 with 50 zeros after it) MPRs. Nguyen et al. have proposed computing a median MPR in such cases. The median is an MPR that is, roughly speaking, in the "middle" of the space of MPRs and is thus a plausibly good representative. More precisely, the distance between two MPRs is the number of events in which they disagree and a median MPR is one that minimizes the total distance to all other MPRs.

In general, there's not just one median. For example, consider the numbers 1, 2, 3, 4. Both 2 and 3 are medians. In higher-dimensional spaces (such as the space of all MPRs), there can be many medians - in fact a huge number of medians. But, a median is still presumably more representative than a completely random MPR. Thus, in "View reconciliations", if "One MPR" is selected, eMPRess chooses a random median. Since there are many medians, in general, you won't necessarily see the same MPR each time you do this!

There's another useful feature in the event that the number of MPRs is large. That option is to cluster the space of MPRs into groups based on similarity. In the "View solution space" pull-down menu, choose "Clusters". A window pops up to allow you to enter the number of clusters that you desire to construct (which can be any number between 1 - which means no clustering - and the total number of MPRs). In our experience, 2 or 3 clusters is generally sufficient. Then, eMPRess uses a clustering algorithm that clusters MPRs according to their distance from one another, using the distance measure described above. eMPRess displays a histogram of the distances between all pairs of MPRs in the first row, the distances between all MPRs within each of the two clusters in the second row, and so forth up to the maximum number of clusters that you've specified.

Finally in "View Reconciliations", you can choose "One per cluster", which will display one randomly selected median reconciliation in each cluster.

This set of features provides a systematic way to find best representative sets of MPRs when the space of MPRs is too large to be adequately represented by a single MPR.

Download and Install eMPRess

Software license information

eMPRess Software

Copyright (C) 2020 Libeskind-Hadas Research Group, Harvey Mudd College

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details: https://www.gnu.org/licenses.

Register (Optional)

If you would like to be notified of updates or announcements regarding eMPRess, please complete this form.

One-click installation for the eMPRess GUI

If you plan to exclusively use the graphical user interface version of eMPRess, you may be able to perform a quick-and-easy one-click install. If this installer doesn't work on your platform, please use the Install empress from GitHub instructions below.

MacOS (requires Catalina 10.15.x)

  1. Go to the latest release on Empress GitHub Releases page and download the zip file named macos-empress-app.

  2. Double click the downloaded file to extract the zipped application to the same place as the downloaded location. An application named empress will be installed. If you opened the zip file on your Desktop, you'll see the empress application on your Desktop.

  3. Double click on the empress application.

  4. MacOS has a security system installed that will not allow you to open applications from untrusted sources. It will politely insist that you move the application to the trash. Please don't move the application to the trash. Instead, open Security & Privacy in System Preferences and you will see there is a button that says Open Anyway. Click on that button and it will open empress GUI application.

Linux

  1. Go to the latest release on Empress GitHub Releases page and download the zip file named linux-empress-app.

  2. Extract the whole directory to a location of your choice. You can do this by right-clicking on the folder and pressing Extract Here. It is important that you extract the whole directory and not just the executable file.

  3. Enter the directory and find a file named empress_gui. Right-click on that file, and click Properties, then select the Permissions tab. Then, check the box that says Allow executing file as a program.

  4. Double click on empress_gui to run empress.

Windows

  1. Go to the latest release on Empress GitHub Releases page and download the zip file named windows-empress-app.

  2. Extract the whole directory to a location of your choice. You can do this by right-clicking the folder and pressing Extract All. It is important that you extract the whole directory and not just the executable file.

  3. Enter the directory and find a file named empress_gui. Double click on that file.

  4. Windows might prevent you from running the application, saying Windows protected your PC. Click on More Info and then click Run Anyway. Windows might take some time to scan the application for viruses. After it finishes scanning for viruses, the application will automatically open.

Install eMPRess from GitHub

Install Python

eMPRess uses Python version 3.7. You can check whether you have python3.7 installed on your computer by typing the following command at the command line (aka terminal):

python3.7 --version

If the command gives you a python version (e.g., 3.7.8) it means you have python3.7 installed. If it says "command not found", please follow the steps below to install python3.7.

If you don't have python3.7 installed in your computer, you can download the installer from the python.org website. For macOS, choose the macOS 64-bit installer. For Windows, choose the Windows x86-64 executable installer. There is currently no python 3.7 installer for Linux on python.org. Alternatively, you can download python3.7 from the Anaconda website or from your favorite package manager tool such as apt-get.

Install pipenv package manager

Empress uses pipenv as its package manager to install dependencies. You can install pipenv using the terminal and typing in the command below:

pip3 install pipenv

or

python3 -m pip install pipenv

On some systems (notably Linux), pipenv may be installed locally, in which case you'll need to add it to your path.

Download empress repository

If you have git installed, you can download empress by typing in the terminal

cd folder-you-want-to-install # go to the folder you want to download empress to

git clone https://github.com/ssantichaivekin/empress.git

If you don't have git installed, you can click on this link to download the zip file. Name the zip file empress instead of empress-master and unzip it into the location of your choice.

Use pipenv to install dependencies

In the terminal, run the following (the # and the text afterward is just explanatory prose for each command):

cd empress # go to the empress folder you downloaded from last step

pipenv install # create virtual environment and install dependencies

pipenv shell # enter the virtual environment with dependencies installed

Each time you restart the terminal, make sure you run pipenv shell before running the empress script.

To start the GUI, type

python empress_gui.py

Please see the documentation for details on running both the GUI and the CLI.

Sample data

This zip file contains four sample datasets, each comprising a host, parasite/symbiont, and mapping (mapping of the tips of the two trees).

Fig-wasp dataset from from Weiblen GD and Bush GW, Speciation in fig pollinators and parasites. Molecular Ecology 2002, 11, 1573-1578.

Gopher-louse dataset from Hafner MS and Nadler SA, Phylogenetic trees support the coevolution of parasites and their hosts. Nature 1988, 332:258-259.

Seabird-louse dataset from Paterson AM, Wallis GP, Wallis LJ, Gray RD, Seabird louse coevolution: complex histories revealed by 12S rRNA sequences and reconciliation analyses. Systematic Biology 2000, 49, 383-399.

Finches and brood parasites from Sorenson MD, Balakrishnan CN, Payne RB, Clade-limited colonization in brood parasitic finches (Vidua spp.) Systematic Biology, 2004, 53, 140-153.

Documentation

Input Files

Three files are required as input: The host tree, the parasite (symbiont) tree, and a tip mapping.

The host and parasite trees must be in newick format and have the extensions .nwk. These trees can have branch length information, but eMPRess ignores it. However, species names should have no whitespace in them. The newick standard used here is that whitespace should be replaced with an underscore symbol. For example Diomedea epomophora should, instead, be Diomedea_epomophora.

The mapping is a text file that ends with the extension .mapping and specifies the association of the tips of the parasite tree to the tips of the host tree. Each line in the file is of the form:

parasiteTipName : hostTipName

Note that this mapping must associate each parasite tip with at most one host tip. It is fine for a parasite tip not to be mapped to any host tip, but a parasite tip cannot be mapped to more than one host tip. Similarly, it is fine for a host tip not to be mapped from any parasite tip. Finally, it's fine for multiple parasite tips to be mapped to the same host tip.

Running eMPRess through the Graphical User Interface

Documentation on running eMPRess through the GUI is available here.

Running eMPRess through the Command Line Interface

Documentation on running eMPRess through the CLI is available here.

Credits and citing eMPRess

Many people contributed to the development of eMPRess, both in the development of the algorithms and the implementation of the software tool.

If you use eMPRess in your work, please cite

"eMPRess: A Systematic Cophylogeny Reconciliation Tool" by S. Sanitchaivekin, Q. Yang, J. Liu, R. Mawhorter, J. Jiang, T. Wesley, Y-C. Wu, and Ran Libeskind-Hadas, in preparation.

The algorithms employed in eMPRess were published in these papers:

"Pareto-Optimal Phylogenetic Tree Reconciliation" by R. Libeskind-Hadas, Y-C Wu, M. Bansal, and M. Kellis, Bioinformatics, Volume 30, Issue 12, 15 June 2014, Pages i87–i95, https://doi.org/10.1093/bioinformatics/btu289

"An Efficient Exact Algorithm for Computing All Pairwise Distances Between Reconciliations in the Duplication-Transfer-Loss Model" by S. Santichaivekin, R. Mawhorter, and R. Libeskind-Hadas, BMC Bioinformatics, 2019 Dec 17;20(Suppl 20):636. doi: 10.1186/s12859-019-3203-9

"Hierarchical Clustering of Maximum Parsimony Reconciliations" by R. Mawhorter and R. Libeskind-Hadas, BMC Bioinformatics, 26 Nov 2019, 20(1):612 DOI: 10.1186/s12859-019-3223-5

The eMPRess code base was developed by S. Santichaivekin, R. Mawhorter, J. Liu, Q. Yang, J.Jiang, T. Wesley, Y-C Wu, and R. Libeskind-Hadas with additional contributions by C. Ngo, P. Andrews, S. Sehra, Adrian Garcia, Alberto Garcia, D. Makhervaks, and Z. Witzel.


FAQ


Feedback, known issues, reporting bugs, etc.

The current version of eMPRess (version 1.0) has some known limitations or bugs listed below. If you find others, or would like to give us feedback or suggestions, please complete this feedback form.

Here are some issues that we're aware of in the current version of eMPRess:

The development of eMPRess was supported by grant 1905885 from the National Science Foundation to Harvey Mudd College.