Poly Hannah da Silva

Comparative Genomics and Phylogenetics

Various combinatorial problems emerge from the analysis of evolutionary events and ancestral relations in comparing two or more genomes. Some models in comparative genomics only consider organizational operations such as inversions, translocations, fusions and fissions.  All these rearrangements can be generically represented as a Double-Cut-and-Join (DCJ) operation.  Some other models also include content-modifying operations such as insertion or deletion of a piece of DNA, called indel, or substitution of a piece of DNA by another piece. For some species (e.g. bacteria like Rickettsia), it has been observed that the content-modifying operations occur more frequently than the rearrangements during the course of evolution, while in some others, the genome rearrangements are more prevalent. In a series of papers (da Silva et al. 2012a, 2012b, 2013, 2016), we generalized the DCJ-indel and the DCJ-substitution models, by assuming distinct weights for the content-modifying and DCJ operations. For these weighted genomic operations, we obtained the exact distance formulas which is computable for for any choice of weights. 

 
Genome representation: each number represent a gene or marker and the sign indicates the polarity or orientation of the gene.

When comparing three or more genomes, one can use the median concept. The median problem has been extensively used in the small phylogeny problem for various genome distances. One goal is to obtain more common information from the given genomes in order to estimate the true ancestor.  We investigated (da Silva et al. 2017, 2018) the breakpoint median problems for a set of random genomes.


Another topic of interest is gene clustering. Methods for the clustering of genes into homologous families (sets of genes descending from a single gene in an ancestral organism) are susceptible to the inappropriate merging of unrelated families, called domain chaining. In (da Silva et al. 2015), we provided formal criteria for the chaining effect by defining multiple alternative clique relaxation and path relaxation models and the relationships among them. We implemented these definitions and applied them to 45 flowering plant genomes in order to compare two clustering methods.

Comparing two clustering methods (image from da Silva et al. 2015).

Ongoing projects:


Evolutionary Dynamics in Structured Populations

The classical Moran process can be seen as an interacting population dynamics on a complete graph, in which any pair of individuals can be in interaction with each other. However, when the interactions are restricted to certain pairs of individuals, this can be generalized to a spatial model, where the interaction can only occur among the neighbors in a given graph. Instead of constant or linear fitness functions, which are quite common in the evolutionary game theory literature, I am interested in a version of the birth-death (BD) and death-birth (DB) processes (also called the generalized Moran processes) in which selection is governed by a very general frequency-dependent smooth fitness functions under the weak-selection regime. Of course, this makes the model much more complicated. For a large population structured as a star graph, we provided approximations for the fixation probability which are solutions of some systems of ODEs (da Silva & Souza 2022). 

As an extension of the star graph, we introduced a class of graphs having a star-like structure, for which we can compute the fixation probability by reducing the analysis of the dynamics of the BD and the DB processes from the general space of configurations to a much smaller space. In this case, the approximate fixation probability is identified as the unique solution to a system of PDEs. These star-like graphs include most of the previous prototypical families studied in the literature.

Ongoing projects: 

Clonal Branching Models

I study the evolution of family size counts for some specific generalizations of the Birth-Death-Immigration (BDI) process, in which the birth, death and immigration rates depend on the clone and time. Different versions of these clonal BDI processes are being considered for which we estimated the population parameters using various inference methods. We make use of the Poisson marking theorem as a very helpful tool to compute different quantities in these models. 

In another work (da Silva et al. 2022b), we explored properties of an evolving model of counts of counts data that arises as the family size counts of samples taken sequentially from a Birth-Immigration process (BI). We studied the correlation of the number of families observed in disjoint time intervals, and found the expected sample variance and its asymptotics for these consecutive sequential samples.

The BDI and/or BI can be applied to predict the appearance of new species in a population or a given biological data set. As an example, we used the BI process as a non-mechanistic continuous-time model to predict the appearance of COVID-19 variants and sub-variants in test centers, or more precisely, we modeled the arrival of COVID-19 DNA sequences at the GISAID database. Under different conditions, we obtained the maximum likelihood estimates for the weekly arrival rates of the existing or new variants and sub-variants. This is joint with A. Jamshidpey and S. Tavaré. 

Ongoing projects: 

Combinatorial Structures and Sampling Theory

 

The Ewens Sampling Formula ESFn(θ) determines the joint distribution of the number of alleles repeated j times in a sample of size n taken from a very large population of selectively neutral alleles. Here θ stands for the rate at which novel alleles appear in the sample. Equivalently, ESFn(θ) gives the distribution of the cycle counts of a θ-biased permutation. Singletons are often considered unreliable, hence ignored in the sampling theory, since they may appear as a result of errors in identifying new alleles. One approach is to reject the samples with singletons. This is in fact equivalent to sampling from  ESFn(θ), conditional on no 1-cycle. Motivated by this, we studied (da Silva et al. 2022a) θ-biased random derangements and constructed finite and infinite  {0,1}-valued non-homogeneous Markov chains that generate the ESF distribution and its approximation, conditional on no fixed points (singletons). Both chains are very useful for fast simulations of derangements. 

Later on, we extended these results (da Silva et al. 2022c) for a more general class of biased random derangements, where instead of a constant θ, the rate at which a new allele appears in the sample, at step i, is given by θi. A {0,1}-valued Markov chain Xn records the cycle type of the corresponding random derangement in which each 1 represents the appearance of a new allele in the sample.  We also established conditional and push-forward relations between Xn and a generalization of the Feller coupling, given that no 11-pattern (1-cycle) appears in the latter.

We also studied (da Silva et al. 2023) the correlation of a collection of samples drawn sequentially from the ESF. We provided a model that may explain the sample variance formulas given by the 1943 Fisher’s paper. We found the expected sample variance of the number of species observed in these samples and established a commutative limit diagram describing the asymptotic behavior of the sample variance when the number or/and the sizes of the samples converge to infinity. We proved that most of the species are observed in all samples, and obtained the log-series distribution as the limit of the distributions of the number of specimens found in the new species in a future sample.


Ongoing projects: