Computational Prediction of E. coli Promoters Using DNA Stacking
Andy Tran, Bharadwaj Pingali
Introduction
The purpose of this project is to create a model that identifies promoter regions using the thermodynamic properties of DNA. Knowing the locations of promoter regions gives us a better picture of what genes are actually expressed in an organism. Because of the complexity of eukaryotic transcription activation, this isn't precisely true. So, we'll be specifically targeting promoter regions in prokaryotes because in prokaryotes, promoter region proximity to a transcription start site directly correlates with the expression of that gene.
Promoter regions precede transcription start sites in the genome. These promoters are where an RNA polymerase attaches and begins to transcribe DNA to RNA. However, RNA polymerase is unable to transcribe DNA on its own (in most cases). Sigma factors mediate the interaction between RNA polymerase and the genome. Sigma factors bind to the DNA and complex with the RNA polymerase to begin transcription.
Image from RCSB Protein Data Bank of sigma factor 70
All bacteria have a primary sigma factor that is present during periods of growth within nutrient-rich environments. In E. coli, the primary sigma factor is sigma 70. Depending on the environment, sigma factors like sigma 32 (heat-shock) and sigma 38 (starvation) will become the dominant subpopulation of sigma factors.
Image from EcoCyc of the sigma 70 promoter accA.
Sigma 70 binding is very strongly predicted by the presence of two hexamer consensus sequences at -35bp and -10bp from the transcription start site. Other sequence qualities have also been identified as recognition elements for sigma 70. Because of all these recognition elements, it can be difficult to predict promoter regions based solely on consensus sequences. We propose a strategy for determining sigma 70 promoter regions based on thermodynamic properties.
Methods
Obtain Known Sigma 70 Promoter Sequences
Sequences were obtained from RegulonDB. These sequences were then filtered to produce a list of 880 sequences that were not inferred computationally. This list of 880 sequences will be referred to as the known promoter regions in the sections that follow.
Obtain E. coli Genome
The K-12 strain of E. coli was obtained from NCBI in FASTA format.
MEASUREMENT
Three thermodynamic metrics were used to identify promoter regions: G/C content, melt temperature and delta G. These metrics were chosen based on the motivation that a factor outside of sequence specificity must be driving the binding of sigma factors. Previous work that has shown that the local thermodynamic environment is different between strong and weak sigma factors supports this.
G/C Content
G/C content is a standard measure of thermodynamic stability. Of the canonical Watson-Crick base-pairings, G/C hybridization is stronger because it utilizes three hydrogen bonds. The G/C content of a DNA sequence can be used to estimate the stability of the hybridized structure.
The G/C content was calculated by counting the guanine and cytosine residues within a reading frame.
Melt Temperature
The melt temperature is the temperature at which 50% of DNA molecules are hybridized and 50% are single stranded. Higher melt temperatures indicate stronger hybridized DNA structures.
The melt temperature was calculated using a script from the BioPython Package under the SeqUtils module.
Delta G (Gibb's Free Energy)
General measure of thermodynamic stability.
The delta G was calculated using a modified version of the melt temperature script.
Both melt temperature and delta G were calculated differently in regard to thermodynamics than the G/C content. The thermodynamics were calculated with the nearest-neighbor method. This method incorporates additional effects such as base pair stacking that occurs between adjacent nucleic acids.
Two main search strategies were undertaken. The first is a 60bp fixed reading frame that moves down the genome to identify promoter regions. The second is a reading frame that fixes two hexamer sequences and slides that down the genome to select promoter regions.
Results and Discussion
Promoter Region Sliding Frame
The first step is to learn what a promoter region looks like. To do this, the G/C content, melt temperature, and delta G were calculated for each of the 880 promoter regions. Each of these measures appear to be approximately normally distributed. These values were then graphed:
Now that we know what a promoter region looks like, we need a way to quantify whether or not these values are significant or not. To do this we run the same calculations on all possible 60bp reading frames in the E. coli genome (>9 million reads) and the compare these values with the known promoter values.
The general trend appears to be that the known promoter regions are less stable than the regions calculated from the genome reading frames. The known promoter regions are lower in G/C content and in melt temperature which would indicate lower stability, but there is an even larger gap between the known and the genome data in the delta G that seems to suggest the opposite.
A t-test for difference of means was done for each pair of data sets. At an alpha level of 0.05, each t-test concluded that the null hypothesis (that the true difference in means is 0) was to be rejected. Therefore we have confidence that we can consider these differences significant.
Now that we know that we can use these metrics to differentiate between a promoter region and a random sequence in the genome, we can start to build a model to select sequences in the genome that we think are promoter regions. Our approach to this was to use the bounds that would cover 95% of the observations within the known promoter regions as selection criteria.
A final run through the E. coli genome was done with a fixed reading frame. At every iteration, the reading frame was selected if it met the selection criteria. Lastly, a script using a modified Knuth-Morris-Pratt algorithm (from Neil Fraser) was used to scan for overlapping sequences to piece together the overlapping sequences into consensus promoter regions.
This proposed model predicted 28585 promoter regions, however there are 1907 known promoter regions. This indicates that this model produces plenty of false positives. Further analysis of the data showed that of the 1907 promoters known, the model accurately predicted 1748 promoters, with 159 promoters missing. Of the 880 promoters that were experimentally determined, the model failed to account for 44 of these promoters. Of the 1027 computationally inferred promoters, the model failed to account for 115 promoters.
Hexamer Region Sliding Frame
The same process was done with the hexamer sliding frames.
The data seemed to indicate that hexamers also had a significant different between the average values for the G/C content, delta G, and melt temperature. This trend follows that of the promoter region analysis. The promoters were found using a similar methods to find promoters in the 60 base pair promoter regions. The results indicated that 6961 promoters which was less than the number of predicted promoters of the 60 base pair promoter region analysis. There seems to be some indication that using the hexamer approach to discovering promoters is more effective than simply looking at the entire region. The model predicted 875 of the known promoters, and there are 880 total known promoters. The model also predicted 477 out of the 1748 promoters. The rest of the numbers are the noise picked up by the model. The total number of promoters missed was 555.
Final Thoughts
Between the two models it appears that the hexamer model picked up the known promoters more easily. Of the 9 million entries, the hexamer data predicted 6961 promoters whereas the promoter region analysis predicted 28585 . In addition to the the hexamer analysis predicted 75 more of the experimentally verified promoters than the promoter region analysis. This prediction also missed 555 promoters whereas the promoter region analysis missed only 159. This means that while the model predicted less overall promoters, it also missed those that were computationally predicted before. It is entirely possible the computationally determined promoters are incorrect, however this is unlikely. Finally it is also important to note that these numbers refer only to the sigma 70 factor, and does not include the promoters from the other sigma factors within E. coli.
The model, through the various conditions imposed, initially determined that there were 6 million promoters of the 9 million entries as predicted by using a sliding reading frame to find the most energetically favorable 60 base pair regions. These entries contained repetitions because each time the reading frame read in an energy, the nucleotides shifted by one and therefore the energy difference in the promoters was negligible and all the iterations of a similar region were accounted for. An algorithm to eliminate the repetitive entries through clustering was written and was used to determine the 8886 promoters of the promoter region analysis. When the proposed promoters are again analyzed, it becomes apparent that there still exist repetitions that were not caught by the initial algorithm, in addition to this if the reading frame of the promoter does not match exactly with the known promoter, the program claims that it does not exist. To have a more comprehensive understanding of the model that was proposed, a better clustering algorithm would be advised.
References
(1) Saecker, R. M.; Record, M. T.; Dehaseth, P. L. J. Mol. Biol. 2011, 412, 754–71.
(2) Weindl, J.; Hanus, P.; Dawy, Z.; Zech, J.; Hagenauer, J.; Mueller, J. C. Nucleic Acids Res. 2007, 35, 7003–10.
(3) Yakovchuk, P.; Protozanova, E.; Frank-Kamenetskii, M. D. Nucleic Acids Res. 2006, 34, 564–74.
(4) Gruber, T. M.; Gross, C. A. Annu. Rev. Microbiol. 2003, 57, 441–466.