WGS Annotator (WGSA) - an annotation pipeline for human genome re-sequencing studies

WGSA is an annotation pipeline for human genome re-sequencing studies, to facilitate the functional annotation step of whole genome sequencing (WGS). Currently WGSA supports the annotation of SNVs and indels locally without remote database requests, allowing it to scale up for large WGS studies.

For gene-model based annotation, WGSA integrates the outputs from three annotation tools (ANNOVAR, SnpEff and VEP) for RefSeq and Ensembl GENCODE gene models, plus ANNOVAR outputs using UCSC knowGene model, and provides a summary of variant consequences from the seven annotation results. To further speed up the process for large-scale WGS studies, we have pre-computed annotations for all potential human SNVs (a total of 8,584,031,106 based on human reference hg19 and 8,812,967,043 based on human reference hg38) and use them as a local database (in the resources/precomputed and resources/precomputed_hg38 folder). For SNV-centric resources, WGSA integrated 14 sets of functional prediction scores (CADD, CDTS, FATHMM-MKL, FATHMM-XF, Funseq, Funseq2, RegulomeDB, DANN, fitCons x 4, GenoCanyon, Eigen & Eigen-PC, GenoSkyline-Plus x 127, LINSIGHT, MACIE), 11 conservation scores (bStatistic, GERP++, PhyloP x 5, phastCons x 3, SiPhy), allele frequencies from 5 large-scale re-sequencing studies (1000G, EP6500, ExAC, UK10K, gnomAD), variants in 4 disease related databases (ClinVar, COSMIC, GWAS_catalog, GRASP2), among others (see list of resources). For indel and SV-centric resources, WGSA integrated MetaRNN and StrVCTVRE. For regulatory region-centric resources, WGSA integrated predicted regulatory regions from multiple epigenomics projects (see list of resources). WGSA also contains rich functional annotations for non-synonymous SNVs and genes from our dbNSFP database, including deleteriousness prediction scores from SIFT, SIFT4G, Polyphen2-HDIV, Polyphen2-HVAR, LRT, MutationTaster2, MutationAssessor, FATHMM, MetaSVM, MetaLR, MetaRNN, VEST4, PROVEAN, M-CAP, REVEL, MutPred, MVP, MPC, PrimateAI, GEOGEN2, BayesDel_addAF, BayesDel_noAF, ClinPred, LIST-S2, ALoFT .

WGSA is provided both as an Amazon Machine Image (AMI) ready to run out-of-the-box and a downloadable version. WGSA and its resources are freely available for academic usage. Licenses are required for non-academic usage of some of the resources, such as ANNOVAR, CADD, CDTS, GenoCanyon, GenoSkyline-Plus, LINSIGHT, SpliceAI and VEST4, Polyphen2, ClinPred and REVEL (in dbNSFP). WGSA does not grant the non-academic usage of those resources, so please contact the original content provider for that purpose.

You can sign up for our email list for future release announcements.

Please report any bug you found or any suggestions or comments to xmliu.uth{at}gmail.com or xiaomingliu{at}usf.edu.

Upcoming:

The next version will focus on updating gene models.

Current Version:

WGSA v0.95 AMI

  • WGSA pipeline AMI: WGSAv095 (ami-030460dca5b954dda in AWS US East region - N. Virginia)

  • A guide for using WGSA (example of using WGSA via Amazon Web Service) can be found here.

  • Guidance for using external resources (COSMIC, SPIDEX, CADD indel) can be found here.

WGSA v0.95 downloadable version

  • The downloadable version (total size of ~1.5 Tb) is available:

  • A guide for using the downloadable version can be found here.

    • A guide for installing WGSA on an external hard drive attached to a Linux machine can be found here.

  • Guidance for using external resources (COSMIC, SPIDEX, CADD indel, dbNSFP) can be found here.

Utilities for WGSA

  • A collection of utilities programs for using WGSA resources and post-processing WGSA annotations can be found here. (update 20190223)

Archives:

Archives of older versions of WGSA can be found here.

Citation:

  1. Liu X*, White S, Peng B, Johnson AD, Brody JA, Li AH, Huang Z, Carroll A, Wei P, Gibbs R, Klein RJ and Boerwinkle E. (2016) WGSA: an annotation pipeline for human genome sequencing studies. Journal of Medical Genetics 53:111-112. [PDF] [preprint] *corresponding author

Changelog:

Update (May 20, 2022): WGSA 0.95 released. This is a minor update focused on the update of resources. Major changes include

  1. StrVCTVRE SV (DEL and DUP) pathogenicity prediction score was added (only supports hg38 and the input file must be in vcf format).

  2. MetaRNN indel pathogenicity prediction score was added.

  3. MACIE pathogenicity prediction score for coding and noncoding SNVs was added.

  4. usDSM pathogenicity prediction score for synonymous SNV was added.

  5. Two PhyloP conservation scores based on 241 mammals and 36 mammals (no human) added.

  6. CDTS function score based on human genome diversity was added.

  7. Genotypes of Chagyrskaya Neandertal were added.

  8. dbNSFP was updated to 4.3c (also support 4.3a).

  9. clinvar was updated to 20220425.

  10. GWAS catalog was updated to e105_r2022-04-07.

  11. dbSNP was updated to b155.

  12. Ensembl Regulatory Build was updated to release104 (hg19) and release106 (hg38).

  13. Default human reference version is now hg38 (use -v hg19 for annotating hg19 based variants).

Update (Feb 8, 2021): WGSA 0.9 released. This is a major update focused on the update of gene-model based annotations. Major changes include

  1. VEP is updated to v100.

  2. ANNOVAR is updated to 20200608 version.

  3. All precomputed gene-model based annotations are updated accordingly.

  4. gnomAD genome is updated to v3.1 (only support hg38). Allele frequencies of controls, non-Cancer, non-Neuro, non-TOPMed subsets added.

  5. gnomAD exome is updated to r2.1.1 (only support hg38). Allele frequencies of controls, non-Cancer, non-Neuro, non-TOPMed subsets added.

  6. dbSNP is updated to b154.

  7. dbNSFP is updated to 4.1c (also support 4.1a).

  8. clinvar is updated to 20210131.

  9. GWAS catalog is updated to e100_r2021-01-29.

Update (June 11, 2020): WGSA 0.85 released. This is a minor update focused on the update of resources. Major changes include

  1. SpliceAI is added.

  2. gnomAD genome updated to r3.0 (only support hg38).

  3. CADD updated to v1.6.

  4. GTEx updated to v8 (only support hg38).

  5. dbNSFP updated to 4.0c (also support 4.0a).

  6. GWAS catalog updated to e100_r2020-06-04.

  7. clinvar updated to 20200609.

  8. Ensembl_Regulatory_Build updated to release 100.

  9. Support vep 96 and later version on the fly (tested on vep 96).

Update (April 19, 2019): WGSA 0.8 released. This is a major update focused on the update of gene-model based annotations. Major changes include

  1. VEP is updated to 94.

  2. SnpEff updated to v4.3t.

  3. All precomputed gene-model based annotations are updated accordingly.

  4. Indel-annotation-via-SNV-annotations is presented in a way that pseudo-SNV annotations have one-to-one correspondence. Please check Indel annotations via SNV annotations.

  5. dbNSFP is updated to 4.0b2.

  6. Clinvar is updated to 20190311.

  7. GWAS catalog is updated to gwas_catalog_v1.0.2-associations_e93_r2019-01-31.

  8. GTEx and Geuvadis eQTLs' targets now use gene symbols instead of Ensembl IDs.

Update (December 12, 2018): WGSA v0.76 released. This is a minor update focused on the update of annotation resources. Please see the guidance before downloading. Major changes include

  1. The default dbNSFP in WGSA is now dbNSFP4.0b1c. WGSA also supports dbNSFP4.0b1a with dbNSFPa_variant option for academic usage of additional deleteriousness prediction scores including Polyphen2, VEST4 and REVEL. See this guidance for using dbNSFP4.0b1a.

  2. Aloft is no longer an independent resource but provided via dbNSFP.

  3. The default ANNOVAR program for indel annotation is now version 20180416, which supports Ensembl gene model for hg38. HGVSp presentation for indel is now supported.

  4. As fathmm-XF coding and noncoding scores are comparable, the two scores are now combined into one fathmm-XF score with additional information for its origin (coding or noncoding).

  5. CADD is updated to v1.4.

  6. WGSA's CADD indel support now assumes the results are from CADD v1.4, which now support both hg19 and hg38.

  7. gnomAD is updated to 2.1. The number of alt allele homozygotes and allele frequencies of controls subsets are now included.

  8. clinvar is updated to 20180930.

  9. dbSNP is updated to b151.

  10. QTEx eQTL is updated v7.

  11. phyloP placental conservation score for hg38 is updated to 30way.

  12. phastCons placental conservation score for hg38 is updated to 30way.

  13. Added phyloP primate 17way conservation score for hg38.

  14. Added phastCons primate 17way conservation score for hg38.

  15. Added bStatistic for hg38.

  16. GWAS catalog is updated to e93.

  17. Known miRNA database miRdb is updated to 22.

  18. miRNA target database TargetScan is updated to v7.2.

Update (July 2, 2018): WGSA v0.75 released. This is a minor update focused on the addition of annotation resources. Please see the guidance before downloading. Major changes include

  1. While preparing input file, duplicated variants will be shown on screen but no longer removed from the input file.

  2. Added FATHMM-XF score, a whole genome deleteriousness prediction score.

  3. Added predicted regulatory elements (15-state and 25-state models) for 127 cell types from the Roadmap epigenomes.

  4. Added GeneHancer, predicted target genes for enhancers (and promoters).

  5. Added eQTLs from the Geuvadis project.

  6. Added dbNSFP v3.4 collection of deleteriousness prediction scores for missense SNVs

  7. Added Aloft, a deleteriousness prediction score for stop-gain SNVs.

  8. Added LINSIGHT, a whole genome function prediction score.

  9. Added bStatistic, a measure of background selection and conservation based on comparative genomics.

Update (February 15, 2018): WGSA v0.71 released. This is a minor update for supporting gnomAD r2.0.2 and the ANNOVAR version of spidex. Changes include

  1. gnomAD updated to r2.02. One column added to the annotation indicating whether the variant is within low complexity region or segment duplication region.

  2. The third-party spidex resource file changed to hg19_spidex.txt. Users can download this file from the ANNOVAR website.

Update (August 6, 2017): WGSA v0.7 released. This is a major update focused on hg38 support as well as resource updates. A clean re-download is recommended. Please see the guidance before downloading. Major changes include

    1. WGSA07 add options to specify whether the input file format is vcf or tsv and whether the variant coordinates are in hg19 or hg38. The full usage is java WGSA07 [setting_file] <-m maximum_number_of_GB_memory_to_use> <-t maximum_number_of_threads_to_use> <-v hg19_or_hg38> <-i vcf_or_tsv>

  1. Add support for coordinates of variants in hg38. For annotating variants with coordinates in hg38, annotation resources native in hg38 will be used if applicable, otherwise, the hg38 coordinates will be converted to hg19 coordinates then be annotated with those native in hg19.

  2. Added allele frequencies of the Genome Aggregation Database (gnomAD)

  3. Added GenoSkyline-Plus scores, a tissue-specific deleteriousness prediction score for 127 cell types.

  4. Added topologically associated domains (TADs)

  5. Added Vindijia Neanderthal genotypes. Genotypes of Altai Neanderthal and Denisova updated.

  6. Ensembl Regulatory Build updated to Ensembl release 88

  7. dbSNP updated to b150

  8. GWAS catalog updated to e88_r2017-05-29

  9. clinvar updated to 20170530

  10. GTEx updated to v6p

  11. Eigen and EigenPC updated to v1.1

  12. ORegAnno updated to 2015.12.22

  13. funseq2 updated to 2.1.6

  14. dbNSFP updated to 2.9.3

  15. GenoCanyon updated to 1.0.2

Update (Sept. 21, 2016): WGSA v0.65 released. This update focused on annotation resources. Major changes include

  1. Eigen and EigenPC scores (Nat. Genet. 48, 214–220) added.

  2. GenoCanyon score (Sci. Rep. 5, 10576) added.6

  3. FANTOM5 enhancer target genes, promoter robust set (phase 1+2), enhancer expression (phase 1), enhancer robust set added.

  4. Super Enhancer (Cell 155, 934–947) added.

  5. Genome Mappability Score (GMS) added.

  6. Duke mappability scores averaged over 300bp windows added.

  7. dbSNP updated to build 147.

  8. ClinVar updated to 20160802.

  9. dbNSFP updated to v2.9.1.

  10. FANTOM5 enhancer updated to permissive set (phase 1+2).

  11. Support CADD indel annotation output. See the guidance here.

The following folders under resources need to be updated if you have the downloaded version of WGSA06 on your computer:

  1. The following folders have been added: Eigen, EigenPC, GenoCanyon, GMS, SuperEnhancer.

  2. The files under the following folders have been modified: clinvar, dbNSFP, dbSNP, Duke_Mapability, FANTOM5, javaclass.

A clean update, i.e. deleting all files under those folders (if exist) and re-download files from the updated resources, is recommended.

Update (Mar. 3, 2016): WGSA v0.6 released. This update focused on gene model annotations. Major changes include

  1. ANNOVAR annotation results were updated as to its Dec. 2015 version. ANNOVAR program was updated to the Feb. 2016 version, which fixed the multiple-thread bug of its Dec. 2015 version.

  2. snpEff annotation results and the program were updated as to its version 4.2. The new 'ANN' format annotations were used.

  3. VEP annotation results and the program were updated as to its version 83. The results of the LoF plugin by LOFTEE are now included.

  4. Precomputed ANNOVAR annotation results for all SNVs with UCSC knowGene model is added.

  5. Transcript-specific annotations are now included by default in the precomputed ANNOVAR/snpEff/VEP annotation results.

  6. Tool by gene model results now can be selected separately. That is, user can choose whether to include ANNOVAR/ensembl, ANNOVAR/refseq, ANNOVAR/ucsc, snpEff/ensembl, snpEff/refseq, VEP/ensembl, VEP/refseq results separately. Accordingly, the format of configuration file has changed. Please refer to the new format here.

  7. User can choose whether to get gene model based annotations (i.e. ANNOVAR/snpEff/VEP x ensembl/refseq/ucsc) of SNVs using precomputed results (recommended for large data sets) or using the annotation tools on-the-fly (faster for small data sets).

  8. User can choose a working directory to store the intermediate files. All intermediate files are gzipped files by default to save disk space.

  9. Gzipped input files are now supported.

  10. Resources of GTEx eQTLs, Roadmap epigenomics peak calls (narrowPeaks for >1000 epigenomics data sets), and allele frequencies of the ExAC r0.3 nonTCGA and nonpsych subsets are added.

  11. User need to download their own copy of COSMIC resource for annotation due to licence requirements. Please see the guidance here.

The following folders under resources need to be updated if you have the downloaded version of WGSA055 on your computer:

  1. The contents under the following folders have been deleted: COSMIC, IntegratedSNV, FAMTOM5 .

  2. The following folders have been added: GTEx, precomputed, Roadmap_peaks, FANTOM5.

  3. The files under the following folders have been modified: 1000Gmask, clinvar, dbSNP, ENCODE, EnhancerFinder, Ensembl_regulatory_build, ESP6500, ExACr0.3, GRASP, GWAS_catalog, hg19, human_ancestor_GRCh37_e71, javaclass, ORegAnno, repeatmasker, scSNV, snoRNA_miRNA.

A clean update, i.e. deleting all files under those folders (if exist) and re-download files from the updated resources, is recommended.

Update (Oct. 6, 2015): WGSA v0.55 released. This update focused on annotation resources. Major changes include

  1. DANN, fitCons and EnhancerFinder resources added;

  2. Support annotation using the SPIDEX free non-commercial version but independent license/download needed.

  3. Genome-wide ranks added to CADD, DANN, FATHMM-MKL, fitCons, funseq-2, GERP++, phastCons, phyloP, SiPhy;

  4. CADD, dbSNP, snoRNA/miRNA, miRNA targets, clinvar updated;

  5. Multiple alt alleles of indels in dbSNP, ExAC, ESP6500 and 1000Gp3 have been separated and left-normalized;

  6. Annotation results using standard variant list file as input will retain all columns of the input file;

  7. Allows options to turn off integrated annotations of ANNOVAR/SnpEff/VEP x Refseq/Ensembl;

  8. Bugs fixed with repeat mask and 1000g mask annotation;

  9. In case there are multiple rows in dbNSFP match the variant, those rows are combined to a single row.

The following folders under resources need to be updated if you have the downloaded version of WGSA05 on your computer: 1000Gp3, CADDv1.3, DANN, ESP6500, EnhancerFinder, ExACr0.3, GERP, PhyloP, SiPhy, clinvar, dbSNP, fathmmMKL, fitConsv1.01, funseq2, javaclass, phastCons, snoRNA_miRNA. A clean update, i.e. deleting all files under those folders (if exist) and re-download files from the updated resources, is recommended.