Due to common genome sequencing protocols requiring some form of PCR amplification of the genomic segments, GC content can be inversely proportional to the amplification/sequencing efficiency. It can thus create a bias in read counts of two genomic segments being compared for their natural starting copy number. Therefore, determining the number of copies of based on the read counts at specific genomic segments needs to account for this bias. We used a nonparametric lowess regression to model and correct for the relationship between GC content and sequencing depth across the genome.
Circular DNA of most prokaryotes undergoes bidirectional replication beginning at the origin (ori) of replication and ending at the teminus (ter). This feature of prokaryotic replication can lead to pronounced biases in the DNA content in genomic locations depending on their proximity to the ori or ter. Microorganisms during an exponential growth phase have multiple replication "bubbles" start at the ori before the entire genome (ter) is replicated. Therefore depending on the growth phase of the microorganism genome being sequenced, it might be important to correct the bias in coverage associated with the direction of genome replication. We used the ori/ter genomic coordinates of the E. coli ancestor strain of the LTEE-clones sequenced to model the bias using lowess regression and correct the trend.
The series of corrected read coverage across windows into the genomic sequences were used as input sequence for a bespoke HMM based algorithm to determine changes in copy number states across the genome. We used the mean and variance of corrected read counts to determine the emission probabilities using gamma parameter of a Poisson distributed coverage profile. Our current algorithm sets an exponentially low transition probability score since the probabilities of frequent changes in genomic copy numbers would be low. Then using the viterbi algorithm we construct an HMM table that records the changes in states across windows in the genome. We output a copy number prediction as an additional column in the input datafile and a plot of the read counts against the predicted copy numbers.