The third week features a course by Piotr Zwiernik on Latent tree models and a number of contributed talks. The main webpage of the program is here.
The course will take place in the Department of Economics and Business, Universitat Pompeu Fabra
Mercè Rodoreda building
C/ Ramon Trias Fargas 25-27
08005 Barcelona
We will meet in room 24.021.
There are numerous lunch options in the neighborhood. The main three options are:
Predicting transcription factor binding sites using phylogenetic footprinting
The computational investigation of DNA binding motifs from binding sites is one of the classic tasks in bioinformatics and a prerequisite for understanding gene regulation as a whole. Due to the development of sequencing technologies and the increasing number of available genomes, approaches based on phylogenetic footprinting become increasingly attractive. Phylogenetic footprinting requires phylogenetic trees with attached substitution probabilities for quantifying the evolution of binding sites, but these trees and substitution probabilities are typically not known and cannot be estimated easily.
In this talk I describe joint work along the last years with the group of Prof. Grosse at University of Halle. Our main results show that:
* Combining phylogenetic footprinting with motif models incorporating intra-motif dependencies leads to an improved performance in the classification of transcription factor binding sites.
* The classification performance of phylogenetic footprinting surprisingly increases with increasing substitution probabilities and is often highest for unrealistically high substitution probabilities close to one. This finding suggests that choosing realistic model assumptions might not always yield optimal predictions in general and that choosing unrealistically high substitution probabilities close to one might actually improve the classification performance of phylogenetic footprinting.
* Motifs published in databases and in the literature are artificially sharpened compared to the native motifs. These findings also indicate that our current understanding of transcriptional gene regulation might be blurred, but that it is possible to advance this understanding by taking into account inter-species information available today and even more in the futur
The rank of positive-semidefinite approximation of low rank matrices and its application in Phylogenetics
In many areas of applied linear algebra, it is necessary to work with matrix approximations. A usual situation occurs when a matrix obtained from experimental or simulated data is needed to be approximated by a matrix that lies in a statistical model or satisfies some specific properties. In this talk we will study symmetric and positive-semidefinite approximations (see [2]) and we will show that the positive and negative indices of inertia of the symmetric approximation and the rank of the positive-semidefinite approximation are always bounded from above by the rank of the original matrix.
We will show how this results can be applied while we are using algebraic tools in phylogenetic reconstruction. One main goal of phylogenetic reconstruction is recovering the ancestral relationships among a group of current species. In order to reconstruct phylogenetic trees it is usual to model evolution adopting a parametric statistical model which allows us to define a joint probability distribution at the leaves of the trees. When these models are algebraic one is able to deduce polynomial relationships between these probabilities, known as phylogenetic invariants. One can study these polynomials and the geometry of the algebraic varieties that arise from them and use it to reconstruct phylogenetic trees.
There is a special situation when these theoretical probabilities can be placed into a matrix that has to be positive-semidefinite of low rank, say $k$, in order to correspond to a distribution arising from a hidden Markov process on a certain phylogenetic tree (see Proposition 4.5 in [1]). The corresponding (k+1)-minors are phylogenetic invariants and their vanishing provide interesting information about the tree topology. But unfortunately these conditions and polynomial relationships are not always satisfied when working with real or simulated data.
The aim of our research is to use phylogenetic invariants and the stochasticity of the parameters of the general Markov model (e.g. the rank constraints aforementioned) to provide insight into phylogenetic inference and to design new methods for phylogenetic reconstruction.
Acknowledgments. I wish to express my gratitude to my Ph.D. advisors Marta Casanellas and Jesús Fernández-Sánchez who suggested this problem and provided knowledge and expertise that greatly supported this research.
[1] E.S. Allman, J.A. Rhodes, and A. Taylor, A semialgebraic description of the general Markov model on phylogenetic trees, SIAM Journal on Discrete Math, 28, 736-755, 2014.
[2] Higham, N. J. Computing a Nearest Symmetric Positive Semidefinite Matrix Linear Algebra and its Applications, 103, 103-118, 1988.
 ACPhyl-lecture1-trees.pdf
ACPhyl-lecture1-trees.pdf ACPhyl-lecture2-models.pdf
ACPhyl-lecture2-models.pdf ACPhyl-lecture3-inference.pdf
ACPhyl-lecture3-inference.pdf ACPhyl-lecture4-estimation.pdf
ACPhyl-lecture4-estimation.pdf ACPhyl-lecture5-submodels.pdf
ACPhyl-lecture5-submodels.pdf