Click on the project listing below for more information about each.
My primary interests are in computational and systems biology and applied machine learning. I develop and apply statistics and machine learning methods to extract useful information from noisy, high-dimensional, high-throughput biological data.
Some of the projects that I have worked on are listed below:
ENCODE (ENCyclopedia Of DNA Elements) project was launched by the NHGRI to identify all functional elements in the human genome using next-generation sequencing techniques and other high throughput assays. Computational algorithms and learning methods are required to identify signal from noise and learn higher order relationships between various functional sites. As part of my postdoc at Stanford University, I worked with the ENCODE consortium to develop algorithms for the integrative analysis of ENCODE data. Specific topics include:
In my PhD thesis, I developed a new predictive modeling framework for studying gene regulation. We formulate the problem of learning regulatory programs as a binary classification task: to accurately predict the the condition-specific activation (up-regulation) and repression (down-regulation) of gene expression. The gene expression response is measured by microarray expression data. Genes are represented by various genomic regulatory sequence features. Experimental conditions are represented by the gene expression levels of various regulatory proteins. We use this combination of features to learn a prediction function for the regulatory response of genes under dierent experimental conditions. The core computational approach is based on boosting. Boosting algorithms allow us to learn high-accuracy, large-margin classifiers and avoid overfitting. In the GeneClass algorithm, we use a compendium of known transcription factor binding sites and gene expression data to learn a global context-specific regulation program that accurately predicts dierential expression. GeneClass learns a prediction function in the form of an alternating decision tree, a margin-based generalization of a decision tree. We introduce a novel robust variant of boosting that improves stability and biological interpretability in the presence of correlated features. We also show how to incorporate genome-wide protein-DNA binding data from ChIP-chip experiments into the framework. In computational experiments based on yeast environmental stress response and DNA damage datasets, we show that GeneClass predicts up- and down-regulation on held-out experiments with high accuracy. We explore a range of experimental setups related to environmental stress response, and we retrieve important regulators, binding site motifs, and relationships between regulators and binding sites that are known to be associated with specific stress response pathways. We present a postprocessing framework for biological interpretation, including gene and gene set analysis to reveal condition-specific regulatory programs and to suggest signaling pathways.
In our Module-clust algorithm we present a generative probabilistic model for combining regulatory sequence and time series expression data to cluster genes into coherent transcriptional modules -- sets of genes where similarity in expression is explained by common regulatory mechanisms at the transcriptional level. Starting with a set of motifs representing known or putative regulatory elements (transcription factor binding sites) and the counts of occurrences of these motifs in each gene's promoter region, together with a time series expression profile for each gene, the learning algorithm uses expectation maximization to learn module assignments based on both types of data. In this way, we make use of genome-wide motif data that is now readily available for organisms such as S. cerevisiae as a result of prior computational studies or experimental results. We also present a technique based on the Jensen-Shannon entropy contributions of motifs in the learned model for associating the most significant motifs to each module. Thus, the algorithm gives a global approach for associating sets of regulatory elements to ``modules'' of genes with similar time series expression profiles. The model for expression data exploits our prior belief of smooth dependence on time by using statistical splines and is suitable for typical time course datasets with relatively few experiments. Moreover, the model is sufficiently interpretable that we can understand how both sequence data and expression data contribute to the cluster assignments, and how to interpolate between the two data sources. We present experimental results on the yeast cell cycle to validate our method and find that our combined expression and motif clustering algorithm discovers modules with both coherent expression and similar motif patterns, including binding motifs associated to known cell cycle transcription factors.
transcription factors are unknown. Hence, automatic discovery of regulatory sequence motifs is required.
In the MEDUSA algorithm, we integrate raw promoter sequence data and gene expression data to simultaneously discover cis regulatory motifs ab initio and learn predictive regulatory programs. MEDUSA automatically learns probabilistic representations of motifs and their corresponding target genes. We apply MEDUSA to various datasets of different sizes in yeast, worm and human B-cells. We learn yeast motifs whose ability to predict differential expression of target genes outperforms motifs from a compendium of known binding sites and from a previously published candidate set of learned motifs. We also show that MEDUSA retrieves many experimentally confirmed transcription factor binding sites. We introduce a novel margin-based score to extract significant context-specific regulators and motifs. We present a specific case study where our collaborators validate some of our regulatory hypotheses using biochemical experiments. We use MEDUSA to study the oxygen regulatory network in the yeast (S. cerevisiae), using a small data set of perturbation experiments that probe the response of yeast to hypoxia (low oxygen levels). We assemble a global map of the oxygen sensing and regulatory network. We also identify many DNA motifs that are consistent with previous experimentally identified transcription factor binding sites. Our collaborators directly test a set of regulators predicted by MEDUSA for the OLE1 gene that is specifically induced under hypoxia, by experimental analysis of the activity of its promoter. In each case, deletion of the candidate regulator results in the predicted effect on promoter activity, confirming that several novel regulators identified by MEDUSA are indeed involved in oxygen regulation.