Compositional genome maps of the coronavirus SARS-CoV-2

Nucleotide sequences are formed by patches or domains of different base composition. In simple, homegeneous sequences, domains can be identified by eye; however, most nucleotide sequences show a complex compositional heterogeneity. We used a computationally efficient segmentation method to analyse such non-stationary sequence structures, based on the Jensen–Shannon entropic divergence (Bernaola-Galván et al., 2012; Oliver et al., 1999).

We divided a given a nucleotide sequence into compositionally homogeneous, non-overlapping domains by using a heuristic, iterative segmentation algorithm (Bernaola-Galván et al., 2008, 1996; Oliver et al., 2004, 1999). In brief, a sliding cursor is moved along the sequence and the position that optimizes an appropriate measure of compositional divergence between left and right parts is selected. We choose the Jensen-Shannon divergence measure (equations (1) and (2) in (Bernaola-Galván et al., 1996)) as the divergence measure, as it can be directly applied to symbolic nucleotide sequences. If the divergence is statistically significant (at a given significance level, s), the sequence is split into two segments. Note that the resulting segments are more homogeneous than the original sequence. The two resulting segments are then independently subjected to a new round of segmentation. The process continues iteratively over the resulting segments while sufficient significance continues appearing. It is worth mentioning that the segmentation algorithm we used, and hence the complexity values derived from it, are invariable to sequence orientation, as Shannon entropy is invariant under symbol interchange. The final result is the segmentation of the original sequence into a series of contiguous segments compositionally homogeneous at the chosen significance level.

The coronavirus RNA sequence can be segmented as is (i.e. a string of four symbols), or as a binary sequence using some of the alphabets (mapping rules) shown in the Table 1 of (Bernaola-Galván et al., 1999). In particular, we used the four nucleotides A, U, C, G for the SCC measure, AU/CG for SCC_SW, AG/CU for SCC_RY and AC/UG for SCC_KM, which provide a complete complexity landscape along the nucleotide sequence. The compositional maps for 3209 coronavirus genomes are shown at UCSC Genome Browser by means of a track hub.

Graphical representation of the compositional segmentation of the coronavirus reference genome (NC_045512.2) at UCSC genome browser. The compositional domains obtained for the four nucleotides A, U, C, G (SCC), AU/CG (SCC_SW), AG/CU (SCC_RY) and AC/UG (SCC_KM) are shown. The GC percent in 5-base windows and the NCBI genes are also shown.

References

Bernaola-Galván P, Oliver JL, Hackenberg M, Coronado a. V., Ivanov PC, Carpena P. 2012. Segmentation of time series with long-range fractal correlations. Eur Phys J B 85:211. doi:10.1140/epjb/e2012-20969-5
Bernaola-Galván P, Oliver JL, Román-Roldán R. 1999. Decomposition of DNA sequence complexity. Phys Rev Lett 83:3336–3339.
Oliver JL, Román-Roldán R, Pérez J, Bernaola-Galván P. 1999. SEGMENT: identifying compositional domains in DNA sequences. Bioinformatics 15:974–9.
Román-Roldán R, Bernaola-Galván P, Oliver JL. 1998. Sequence compositional complexity of DNA through an entropic segmentation method. Phys Rev Lett 80:1344–1347.