Indel annotations via SNV annotations

Step 1: Convert an indel to local SNVs. Similar to predicting the effect of an indel in the coding region of a protein (an example is given in Figure 1 of Zia and Moses, 2011), we assume an indel may change the local DNA motif (if exists) by introducing SNVs. However, we do not know beforehand that to what extent an indel may affect its nearby regions. Therefore, mostly for convenience, we only “translate” an indel to SNVs within the local region no larger than the length of the inserted or deleted allele, which we called the focal length of an indel.

The above figure illustrates an example of an insertion. The insertion is defined by chromosome, position, reference allele and alternative allele, as defined in the vcf format (Danecek, P. et al., 2011). Excluding the anchor nucleotide “C”, the inserted allele is “TGTT”, therefore the focal length of this insertion is 4. This inserted allele will correspond to “TGAG” on the reference sequence when anchoring left. Then within the focal length 4 we can define 4 SNVs: T>T, G>G, A>T and G>T. The first two SNVs have identical reference allele and alternative allele, which we called pseudo-SNVs. With a pseudo-SNV one cannot get an annotation from a SNV-centric annotation resource (e.g. CADD) but can get annotation from a position-centric annotation resource (e.g. GERP++). The remaining two SNVs are normal SNVs introduced by the insertion; therefore the “focal SNV number” for this insertion is 2. Similarly, focal SNVs and pseudo-SNVs can be defined for deletions and block substitutions (sub-figure B,C).

Please note we defined the corresponding positions of an indel allele to the reference sequence by anchoring the allele to the left. The reason we choose anchoring left instead of anchoring right is that many of the insertions (and deletions) are produced by inserting (or deleting) units of short tandem repeats. Typically such an insertion (or deletion) is presented as inserting (or deleting) the first such unit(s) in the reference sequence in vcf files. Therefore, by anchoring left we can easily identify those events as the focal SNV number will be 0 (and we assume those events shall have less functional effects than events whose focal SNV number is larger than 0).

Step 2: Annotating focal SNVs and pseudo-SNVs. The second step is simply annotating those focal SNVs and pseudo-SNVs through the SNV annotation pipeline, except for the resources that can annotate indels directly (e.g. allele frequencies in human populations). As mentioned above, pseudo-SNVs will get “missing data” from SNV-centric annotations but can get annotations from position-centric annotations, such as conservation scores and regulatory segmentations.

Step 3: Summarizing annotations of focal SNVs and pseudo-SNVs. The last step is summarizing the annotations of focal SNVs and pseudo-SNVs for each indel. In the current implementation, we simply separate each SNV’s annotation with a “|” and the order of the annotations corresponds to the same order of the pseudo-SNVs. For example, for the insertion shown in sub-figure C, we obtain two CADD phred scores (2.074 and 5.290) for the two focal SNVs and “missing data” for the remaining two pseudo-SNV, and we summarize as “.|.|2.074|5.290”, where “.” represents missing data. Note: In the earlier implementation (v0.5-0.76), we count the number of each unique annotation we got from the SNVs. For the same example, we summarized it as “.{2}2.074{1}5.290{1}” for this substitution.

Reference:

Zia, A. & Moses, A. M. Ranking insertion, deletion and nonsense mutations based on their effect on genetic information. BMC Bioinformatics 12, 299 (2011).

Danecek, P. et al. The variant call format and VCFtools. Bioinforma. Oxf. Engl. 27, 2156–2158 (2011).