Research

An artwork illustrating  our utilization of feature selection and feature aggregation, techniques in machine learning to integrate multi-scale omics data for genotype-phenotype association analysis (Cao et al, Genetics 2022, Feb Cover Feature). This is an example of our work in Research Theme 1 .

Research Theme 1: We integrate machine learning and statistical techniques with in-between-omes (e.g., epigenome, transcriptome, etc) and brain image to bridge the gaps between genotype and phenotype. A gap between the "big data" in -omics and traditional machine learning methods is that we have limited sample but too high dimensions. So the "big" aspect in the genomic data are actually called noise by computer scientists. 

Being aware of this, a central theme of our research is how to integrate biological insights and novel algorithms in computer science, such as transfer-learning and representation learning. Together with local and international collaborators, we apply our methods to two domains: (1) brain diseases including neurodevelopmental disorders and neurodegenerative disorders; and (2) cancers.

Artificial Intelligence (left) for the characterization of brain diseases (right) 

Research Theme 2: Biostatistic and bioinfomatic anlaysis are facing several challenges: (1) how to nimbly analyze large -omics datasets using lightweight infrastructure; (2) how to infer underlying strucutre out of limited observations; (3) how to accurately capture sensible signals despite of inherent noise and bias in the raw data.  

Based on our experiences in data anlaysis and collaborations, we develop bioinformatic and biostatistic tools to handle state-of-the-art problems encountered by researchers. So far, we have developed several toolkits focusing on (1) haplotype inferences from pooled sequencing, (2) out-of-core tools (that store data in disk however access them as though they were resided in memory) for statistical genetics, and (3) fast and robust tools for population genetics and evolutionary analysis.  

An artwork for PoolHapX, a tool inferring haplotype from pooled sequencing to support evolutionary analysis (Cao et al, MBE 2021). This is an example of our work in Research Theme 2.

An artwork of CATE, a tool handling huge genomic data using CUDA-accelarated, highly paralelled, and out-of-core algorithms, an example of our work in Research Theme 2 (Perera et al, MEE 2023)

Research Theme 3: We are interested in theoretical questions in machine learning, inspried by our observations in real data analysis and the latest publications in ML fields. In particular, we are focusing on the behaviour of machine learning models when domain knowledge is integrated and/or transfer learning is applied. We are also trying to characterize when and how nonlinear biological systems may be approximated by linear models.  



We are also open to new research themes that align to our resources and expertise. Our publications and their citations could be found at Google Scholar Citation Report. Several representative works and latest manuscripts are listed here.

Funding Support