SAGAconf

I have been working with PhD student Mehdi Foroozandeh in the Libbrecht Lab at Simon Fraser University in an area of genomic modelling using ML, over summer 2023, and now since spring 2024.

Expanding on a novel "reproducibility" metric introduced by him, I am investigating various hyperparameters and factors towards developing guidelines and best practices for practical deployment of state-of-the-art models for genome Segmentation And Annotation Algorithms tasks. Concretely, I am administering--and building infrastructure to facilitate--various evaluation experiments to carry out said investigations.

Quick Links

But first, what are genome Segmentation And Annotation Algorithms (SAGAs) and why should anyone care?

Our ability to sequence--or read--DNA and RNA has been rapidly accelerating, even faster than Moore's Law. With this voluminous stream of incoming data, automated algorithms are needed to help us interpret and leverage it to understand the "language" of the genome, our DNA.

Specifically, genomes are comprised of many functional elements. Each of these sections of our DNA can be classified into functional classes, from gene bodies that serve as "recipes" encoding the contents of the proteins that run our lives, to promoters, enhancers and repressors that dynamically regulate the transcription--or "building"--of proteins in response to environmental variables.

Getting a picture of the functional elements that comprise genomes is important for human, plant and animal health, as well as advancing our understanding of gene regulation in general. So, how can we do so? 

It turns out that cells maintain a system of chemical "tags" on DNA, collectively called the epigenome, and that each class of functional element tends to carry a characteristic epigenetic "profile". Laboratory techniques like ChIP-seq have been developed to characterize the epigenome. Segmentation And Annotation Algorithms (SAGAs) segment genomes, predicting the functional elements that it comprises of from data from said lab techniques.

Note: Please read the review Segmentation and genome annotation algorithms for identifying chromatin state and other genomic patterns for a proper introduction to SAGAs.

Epigenetics: histone modifications and gene expression.
National Institutes of Health, Public domain, via Wikimedia Commons

The role of enhancer, promoter, and transcription factor genomic elements in gene regulation.
Source: Mattaini, Katherine. “Chapter 17. Regulation of Gene Expression.” Rwu.pressbooks.pub, 27 July 2020, rwu.pressbooks.pub/bio103/chapter/regulation-of-gene-expression/.

Figure 1, Overview of genome segmentation and annotation algorithms. Source:

Libbrecht, M. W., Chan, R. C. W. & Hoffman, M. M. Segmentation and genome annotation algorithms for identifying chromatin state and other genomic patterns. PLOS Computational Biology 17, e1009423 (2021).

Factors I am investigating:

Resolution (bin size)

Width of bins to aggregate data into

Omitting certain assays

Do certain histone marks, transcription factors, etc. assays significantly negatively affect SAGA accuracy?

Preprocessing: dimensionality reduction

First perform dimensionality reduction on the data and input this into the model.