Our research centers around machine learning motivated by computational biology and biomedical research. The figure below shows gene regulation, and a brief overview of the research topics is categorized as follows. Each of them focuses on a different aspect of gene regulation, highlighted in grey. However, one should note that many of these research topics are multidisciplinary, and are made possible by the strength of both statistical signal processing/machine learning and biomedical research. The categories described below are artificial for convenience reasons, and it is sometimes difficult to assign some of the research results into a single category.
Consider the problem of predicting symptom severity based on gene expression, in which the dimension of gene expression is much larger than the number of samples. Dimension reduction methods that incorporate biomarker screening and prediction would be useful in these problems. Partial least squares (PLS) regression is a supervised dimension reduction method, which incorporates prediction into dimension reduction. It does not involve matrix inversion nor diagonalization. Hence, it has been successfully applied to problems with large predictor dimension. However, as the number of predictor variables increases, PLS can suffer from over-fitting, i.e., the prediction performance degrades, and the parameters become difficult to interpret. A global variable selection approach was proposed, which penalized the total number of variables across all PLS components. Results showed that the proposed formulation successfully reduced model complexity by selecting many fewer predictor variables, while achieving good prediction ability.
Consider the problem of designing a panel of complex biomarkers to predict a patient's health or disease state when one can pair his or her current test sample, called a target sample, with the patient's previously acquired healthy sample, called a reference sample. As contrasted to a population averaged reference this reference sample is individualized. I introduced a sparsity penalized multi-class classifier design to account for multi-block structure of the data, which arises naturally in serially sampled data or spatially diversified sampling experiments. The classifier was trained to minimize an objective function that captures the unified miss-classification probabilities of error over the classes in addition to the sparsity of the weights. Results showed that the disease prediction rate was improved and the method was able to control irrelevant patient variations.
Translational Dynamics - Prediction of Ribosome Densities
The ribosomes are not uniformly distributed along the transcripts. Understanding how transcript-specific distribution arises and to what extent it depends on the sequence contents is fundamental for unraveling the translation mechanism. Motivated by the observed profiles of ribosome footprints in the literature, which seem to distribute far from uniformly, and the different hypotheses explaining the underlying mechanism in the literature, here I focus on the prediction of marginal densities of ribosome footprints using solely the sequence context. This is an interesting machine learning problem, in which the predictors are categorical, and the response variables are continuous. The ability to predict the marginal densities based on the sequence contents alone has many potential applications in various areas, including isoform specific ribosome inference, design of transcripts with fast translation, etc.
Integrative Longitudinal Analysis of Ribosome Occupancy and Protein Synthesis
The regulation of gene expression is composed of transcription and translation. During translation, ribosomes traverse each codon of the mRNA transcripts to synthesize proteins according to the message encoded in the transcripts. Although much had been studied in the transcription level with the advances in microarray and deep sequencing, studies of the translational dynamics remained challenging until the development of ribosome profiling. Ribosome profiling provides a snapshot of the distribution of these ribosomes along transcripts and enables quantitative monitoring and analysis of the translational process. In this project, a collaboration with Dr. Arun Wiita's lab at UCSF, I developed functional data analysis methods that jointly analyze mRNA-seq, ribosome profiling, and pulse-chase isotopic labeling mass spectrometry-based proteomics. Our work offers a novel quantitative framework to understand translation using a combination of emerging technologies. Taking advantage of this model with concurrent biochemical and genetic experimentation may allow us to identify these factors that govern translational regulation in cancer and potentially eukaryotes more broadly, and shed light on targeted therapies.
Automated Analysis of Heterochromatin Dynamics
The genome and physiology of a cell can undergo complex changes among the many cells that make up a growing microbial colony. Genetic and physiological dynamics can be revealed by measuring reporter-gene expression, but rigorous quantitative analysis of colony-wide patterns has been under-explored. In this collaboration with Dr. Jasper Rine's lab at UCB, I developed a suite of automated image processing, visualization, and classification algorithms (Morphological Phenotype Extraction: MORPHE) that facilitated the analysis of heterochromatin dynamics in the context of colonial growth and that can be broadly adapted to many colony-based assays in Saccharomyces and other microbes. Using the features that were automatically extracted from fluorescence images, MORPHE revealed subtle but significant differences in the stability of heterochromatic repression, which were not apparent by visual inspection.
Automated Image Segmentation and Feature Extraction with Applications to Cell Deformation, Heterochromatin Dynamics
Modern developments in light microscopy have allowed the observation of cell deformation with remarkable spatiotemporal resolution and reproducibility. Due to the considerable complexity of cell deformation and migration, visual analysis of such processes is not only limited by user bias, but also fails to apprehend large-scale, population-wise patterns that may otherwise appear random or disorganized. Systematic quantitative analysis and understanding of such phenomena are therefore becoming a major interest for the signal processing and computer vision communities. A combination of shape description, i.e., spherical harmonics analysis, and machine-learning techniques was proposed to analyze amoeboid cell spatiotemporal deformation, recorded as time-lapse sequences of volumetric 3D images.
Automated Analysis of Electrocardiogram to Identify the Origin of Arrythmia
Ventricular tachycardia (VT) is a potentially life-threatening arrhythmia that can lead to ventricular fibrillation and sudden death. Detecting and localizing VT are therefore important in the area of electrocardiology. The data consists of high dimensional time series with high variability. Developed algorithms that used single lead electrograms as a surrogate for 12 lead electrocardiograms and automated classification or prediction of the origin of VT based on electrocardiograms can result in a reduction of the time duration of the pace-mapping procedure, which usually takes more than 6 hours.