Further analysis of G quadruplexes (G4s) consists of identifying G4s in an array of organisms with varying GC content [5], which were selected on the basis of the optimal temperature of the organism [6-13] because higher GC content is associated with genome thermostability [4]. The G4 motif finder was applied to genomes extracted from the National Center for Biotechnology Information (NCBI) [5], and the G4s per million base pairs was calculated. Organisms were then sorted by descending GC content [5] to visualize differences in the frequency of GC content in organisms with varying GC content (Figure 2). In addition, higher GC content is also associated with genome length [4]. Information regarding genome length was collected [5] and further analyzed (Figure 3).
Figure 1: This tool can be used to analyze G4s in various genomes [5-13]
Figure 2: Visualization of frequency of G4s of organisms organized by GC content in ascending order.
Figure 3: Effect of genome length was applied to primary analysis. The resulting figure displays the G4 frequency of organisms organized by genome length in descending order.
The G4 motif finder was then applied to the human and yeast genomes, with particular attention of G4s per chromosome. All genomes were extracted from NCBI.
Figure 4: Number of G4s total per yeast chromosome, including the mitochondrial genome.
Figure 5: Number of G4s total per human chromosome, including the mitochondrial genome.
Figure 6: Comparison of G4 frequency and GC content in each chromosome of the human genome.
Figure 7: Percentages for total G-Quadruplexes found within genes and outside of genes for complete yeast and human genomes.
Figure 8: Top 10 enriched genes in the yeast genome. As the entire genome only has 9 G-Quadruplexes, all 9 are shown in this list. Each gene contained only one G-Quadruplex.
Figure 9: Top 10 enriched genes in the human genome.
Figure 10: GO enrichment analysis was performed using G:Profiler[17] on the highlighted genes most associated with G4s. This analysis indicated significant enrichment of the GO terms for interleukin receptor SHC signaling and Surfactant metabolism.
G4 variation across an array of genomes
Of the array of genomes tested using the constructed G4 motif finder, genomes with less than 31.5% GC content did not have G4s in their genome identified using the G4 motif finder (Figure 2). This may be due to G4s forming with the presence of more Guanines in a genome. Of note, organisms from the Thermus genere had more G4s than other organisms in the array. This may be due to organisms being associated with warmer temperatures [11,13], which typically correlates to having higher GC contents [4]. However, having higher GC content does not fully explain this finding. As seen with S. erythraea, which had the highest GC content, this organism did not have the highest amount of G4 frequency (Figure 3). Thus, these findings require more understanding. Future steps could include using a wider array of organisms and using a less stringent G4 motif finder.
G4 Variation by Chromosome
To examine how G4 counts vary in organisms that have multiple chromosomes, we applied out G4 motif finder to both yeast and human genomes (Figure 4 and 5).
In yeast, it is not uncommon for entire chromosomes to have one or no G4s. Interestingly, two of their chromosomes had three. Knowing that G4s sometimes have regulatory functions, and are not entirely random, it makes sense that G4s are not randomly distributed throughout the genome.
In humans, we may expect that number of G4s will be determined by chromosome length. To some degree, we did see this in the results. Chromosome 1 in humans is the longest and also had the most G4s. However, G4 count by chromosome cannot entirely be explained by chromosome size. For example, chromosomes 19, 20, and 21 do not differ widely in size, but chromosome 20 has far more G4s than the other two.
Together, these results in both yeast and humans suggests G4 content per chromosome is not random and instead serves some potential purpose dependent on the gene content of the chromosomes.
Is G4 formation related to GC content?
To identify if G4s are more frequent in chromosomes with higher GC content, we compared these two metrics for each of the human chromosomes (Figure 6). The data suggests a strong positive correlation between G4 formation and GC content. Chromosome 19 has the highest GC content at 48% and also has the highest frequency of G4s per million bp at 117. While chromosome 13 has the lowest GC content at 38.5% and the lowest frequency of G4s per million bp at 35.
Are G4s mostly within genes or outside of genes?
Knowing that G4s can be associated with the promotor regions of genes, we were curious to see whether or not the G4s we found were within genes or not (Figure 7).
For yeast, which have a low number of G4s to begin with, the number of G4s in genes and not in genes was about equal. This even distribution suggests there is no particular pattern to G4 placement in the yeast genome- that is, G4s have an equal chance of being within a gene or outside a gene.
Oddly we found that in humans where far more overall G4s were present, there is a higher rate of G4s present within genes at 64%. This may indicate that there is some selection for G4 formation present within genes for humans.
What is the significance of genes that contain G4s?
We were curious to know if certain genes were more enriched for G4s than others, or if there was an intuitive explanation for why some genes had G4s. To answer this question, we created data frames of the top gene hits for the G4s we found (Figures 8 and 9).
In yeast, the total number of G4s was low, so each gene we identified contained only one G4. We can next ask the question: is there any pattern in the genes that do have G4s? However, there was ultimately no obvious trend among the 9 genes we identified in yeast, but some of the genes were connected to regulation. Bas1p, for example, is a transcription factor, and Tbf1p helps transcription factors bind to DNA. We can speculate that there is some advantage for transcription factors to be associated with G4s; perhaps their particular G4 motif plays an important role in how transcription factors fold and function.
GO term enrichment
We also wanted to determine if GO enrichment analysis, using G:Profiler [17] of genes that were highly associated with G4s in the human genome could elucidate any trends regarding G4 formation in humans. This was done by compiling the genes were highly associated with G4s by either having G4s within the gene, downstream of the gene or nearest the gene. This analysis significant enrichment of the GO terms for interleukin receptor SHC signaling and Surfactant metabolism. Because G4s are typically found in promoter regions and are believed to have regulatory effects [3], enrichment regarding metabolism is expected. Interleukin receptor SHC signaling is involved with regulating Ras and Ras-like pathways [18]. These pathways are often associated with cell growth and differentiation [19], further indicating that G4s are highly associated with regulatory processes in humans. Further work can be done by analyzing abundance of G4s in genes associated with Ras to gain a deeper understanding of the importance of G4s in gene regulation.