Predicting Copy Number Variation in Microbial Genome Resequencing Data



Dani Lawson & Pranesh Rao

Introduction

Copy number variation (CNV) is a type of structural variation,  either chromosomal duplication or deletion events, in which variable sections of the genome are repeated and the number of gene copies varies from one individual to the next (NIH, 2024).  In haploid organisms, CNV can occur via tandem segmental duplication events as well as transposable elements (Zhang, 2013). Copy number variants can cause a multitude of biological responses such as metabolic changes, alterations in disease susceptibility, genetic disorders and diseases, morphological and physiological changes, as well as contribute to population diversity (Pos, 2021).  The ability to predict CNVs in haploid genomes can help researchers better understand the underlying effects of CNVs and study their implications in biomedicine. 

breseq is a computational pipeline for analyzing haploid genome re-sequencing data developed by the Barrick lab (Barrick, 2014). Most commonly, it is used to analyze short-read and long-read sequencing data to find mutations (point mutations and structural variations) relative to a reference genome. Currently, one shortcoming of breseq is that it is not able to automatically predict and report regions of bacterial chromosomes that are duplicated or amplified to more than two copies and/or it is not able to resolve exactly how the chromosome has changed. Furthermore, CNV detection based on sequencing read-counts could be complicated, due to biases in read counts across the genome, caused by variability introduced due to different sequencing methods (library preparation) and origin of DNA replication. 

Figure 1 : Overview of the computational pipeline used by breseq to identify and annotate mutations (Barrick, 2014). 

GC Bias

Previous research on the effects of GC bias in sequencing data have shown that both GC-rich and GC-poor areas of the genome can cause a low coverage of reads leading to assembly fragmentation (Chen, 2013). Furthermore, the relationship between GC bias and read coverage can be explained by this bias, where a genomic region of higher GC content tends to have more variable coverage (could be more or less) (Chen, 2013). 

Figure 2:  Scatter plots of GC coverage vs. read counts, two separate cases representing negative and positive GC bias (Chen, 2013).

Origin of Replication Bias

Furthermore, bias introduced by features of circular DNA replication can lead to an overrepresentation of sequences at the origin and an underrepresentation of sequences closer to the terminus.

Figure 3: A shows a general prokaryotic circular DNA replication scheme. The bias in reads across genomic coordinates is represented in B (Syeda, 2020).

Figure 4: Workflow of CNOGpro (Brynildsrud, 2015).

Once corrected for these biases, a Hidden Markov Model (HMM) could be implemented in order to predict copy number variation, such as amplifications or deletions in genomic regions and could be used in tandem with breseq to predict CNV in haploid re-sequencing data. This can be accomplished via CNOGpro (Copy Numbers of Genes in prokaryotes), which can quickly estimate the number of copies of genes/genomic segments from re-sequencing data (Brynildsrud, 2015). CNOGpro is described as a tool which uses a hidden Markov model and constructs confidence intervals by estimating copy number via bootstrapping (Brynildsrud, 2015).

Objective

To create a robust pipeline, that can be used in conjunction with breseq, for predicting CNV in microbial haploid genome re-sequencing data. 

To achieve this objective we must: