Due to the nature of inconsistent in sequencing quality of the genomes, the genotyping analysis has a lot of background noise. We made our best effort to remove erronous strains that could serverly impact our alignments.
The complete genomes of SARS-COV-2 is used to reconstruct the phylogeny. We retrieved the reference clade gene from Nextstrain and inferred the maximum-likelihood tree of complete genomes using the default option model in IQ-TREE (Nguyen et al., 2015). As it can be inferred from the clade phylogeny tree of Nextstrain (Figure 2b), Figure 2a shows two dominant clades; A2a and B1. Interestingly, most strains clustered in B1 clade doesn’t include the strains from NY, whereas the strains grouped in A2a clade does indicate the strains from NY.
LEISR results can be interpreted as the the relative evolutionary rate at each site with respect to the input nucleotide and amino acid sequences of SARS-CoV-2. The output shows mutation diversity, meaning that the sites with higher rate values indicate more variations compared to the rest of the sequence. From our LEISR result (Figure 3) and the generated MSA (Video1) from the complete genomes dataset, high peaks (~10.0) from the plot are likely due to sequencing errors, while medium peaks (~5.0) show the true/actual nucleotide mutations.
We leverage LEISR result and coupled with MSA information to genotype the whole genome and amino acid sequences in our dataset. Table 1 shows the summarized list of those frequent mutation according to our results from 2154 WGS MSA and LEISR analysis. The high-frequency nucleotide mutations predominant in the virus isolations locate either in the RNA replicase or structural protein including S protein, E protein, and N protein. Therefore, we speculate that the observed high-frequency mutations may contribute to increased transmissibility.
With the observation that the sequence identity of the alignment is over 99%, we think that dN/dS analysis is not an appropriate/beneficial test for a number of reasons. First, our dataset is a collection of SARS-CoV-2 infections coming from a very small time window, with a limited geographical location (USA). We did not include any closely related coronavirus strains, so we cannot calculate divergence time.
As the S gene is the major antigen targeting for vaccine development, we decided to focus on the sequence analysis of S gene on both nucleotide and amino acid levels. Again, based on our LEISR result (Figure 4) and the MSA (Video2) , high peaks (~1.00) are also likely due to sequencing errors, while medium peak(~0.03) shows the actual nucleotide mutation sites. Figure 5 shows the same data but with the amino acid sequence in the x axis from the Figure4, and Table 2 shows the summarized list of those frequent mutation according to our results from 2154 S genes MSA and LEISR analysis. The three most frequently mutated regions in S gene are NTD, RBD, and the gene near the S1/S2 furin protease cleavage sites. Surprisingly, the nucleotide mutation '1841A>G' and its corresponding amino acid substitution 'D614G' in S gene is identified as the genomic site that has the most abundant mutation. We will call the D614G as ‘hotspot’. When we analyze the S gene MSA dataset in Geneious, the most of the strains located in NY have D614G, which is likely due to community transmission. In contrast, the half of the strains located in WA have D614G and the other half shows D614 amino acid without mutation (D614). Also, all the strains from cruisehip travelers have D614. Thus, this result suggests each geographical region shows its unique and distinctive SNP mutation at D614 hotspot position, indicating the community transmission within the same region.
Then, we went back to the phylogeny tree in Figure 2 to see if which mutation is predominant in clade A2a and B1. Interestingly, the reference genome of A2a clade has D614G, whereas the reference genome of B1 clade shows the D614 without mutation. These findings are consistent with the results that B1 clade with D614 does not include the strain from NY, whereas the A2a clade with D614G shows strains from NY. As we identified that the D614G in S gene is the hotspot of SNP, we, therefore, especially focus on D614G hotspot in further analysis to study how D614G hotspot affects the structure of S protein.
Next, we ask question of how the D614G hotspot can potentially affect the structure of S protein. In this analysis, we choose two strains; one (GISAID accession ID = 420305) that has residue D614, the other (GISAID accession ID = 424964) that has residue D614G (mutation when compared to the Wuhan reference). The structure of S gene of those two strains were predicted using a comparative (homology) modeling provided by Chimera and Modeller. The S gene structure(PDB ID=6VSB) was used as a template for the modeling (Video 3). Each modeled protein with the lowest z-dope score was chosen to be overlaid onto the template S gene structure via Pymol (Wrapp et al., 2020). Video4 and video5 show the overlaid structure of D614G or D614 onto the reference structure respectively. Surprisingly, RBD region of D614G structure does not align well with the RBD of reference structure, whereas the RBD of D614 structure overlays well onto the reference. As D614G is near the S1/S2 cleavage site(RRAR) at 682-685 amino acid, we assume that substitution of the aspartate, which contain negatively charged side chain, to the glycine with small/simple and neutral side chain affects the interaction with the neighboring residues, such as furin protease cleavage site, in a beneficial way for viral evolution and increased transmission of SARS-CoV-2.
Finally, to further investigate the temporal SNP composition of amino acid at position 614 in S gene in USA, we plot the number of strains with D614 or D614G residue according to the timeline from early Jan to late Apr. Surprisingly, D614G becomes more dominant than the D614, which was originally prevalent(Figure 6). These findings suggests that hotspot SNP mutation sites serves a promising indication for monitoring and tracking of SARS-CoV-2 in their temporal and local genetic variations. Since we found no significant evidence that the "hot spot" mutation has an impact on the protein function itself, we came up with several speculations/hypotheses:
First, the shift in D614G in later time could result from an undetected super spreader in the community that carry the genotype.
Another explanation of why we observe the shift of SNP proportion in the population could be due to of the imbalance of numbers of submitted sequences within the US. Since early February, Seattle became the first epicenter of Covid-19 in the US and that led to the massive sequencing efforts from local scientists. Given the facts that strains circulated in WA originated from Wuhan (analyzed by Trevor Bedford) and most strains in the dataset with earlier submission dates come from WA, D614 which is known as the dominant strain would account for a higher proportion in earlier days. In contrast, dominant strains in the NY outbreaks beginning to circulate from mid March are closely related to the SARS-CoV-2 strains from Europe (Gonzalez, A. S. et al., 2020). Perhaps this change of amino acid is can also be found in Europe samples. Further analysis from strains of other countries will provide insights and clues to our speculation.