Multiple Sequence Alignment
The 2153 SARS-CoV-2 whole genome sequences from patients were downloaded from GISAID database (Shu and McCauley, 2017) with the following filters:
Location: North America (Chart 1)
Submission date range: Feb - April 20th, 2020
Complete genome and high coverage
Reference genome of SARS-CoV-2 ((Wuhan-Hu-1) was downloaded from NCBI (Accession ID: NC_045512).
The alignments were performed using MAFFT (v7.450) with default paramaters.
Phylogentic tree estimation
We generated maximum-likelihood trees of complete genomes and S gene only with the default option model in IQ-TREE (Nguyen et al., 2015). Gene tree was generated for downstream analysis purposes. We added clade reference sequences based on NextStrain clade phylogeny tree.
Evolutionary rates inference
To infer evolutionary rates from phylogenetic protein and nucleotide data, we used LEISR (Likehood Estimation of Individual Site Rates, pronounced "laser") (Spielman and Pond, 2018), implemented in HyPhy (Hypothesis Testing Using Phylogenies) .
LEISR requires a phylogenetic tree and the multiple sequence alignment.
The algorithm can be broken down in two steps: (1) obtain estimates of alignment-wide branch length under a specified substitution model (for simplicity, we chose GTR model) and (2) calculating a relative rate Rs at each site, which is uniformily scale all the branch lengths of the given partition-specific tree.
We process .JSON output format and export to R for visualization.
Comparative(homology) modeling
To study how the D614G hotspot affects the structure of S protein, we used UCSF Chimera, UCSF Modeller, and PyMol.
Two strains were chosen : (1) one with D614 (GISAID accession ID = 420305) (2) the other with D614G (GISAID accession ID = 424964) in S gene.
The modeled protein with the lowest z-dope score was chosen to be overlaid onto the template S gene structure(PDB ID = 6vsb) via Pymol