WGS Annotator (WGSA) - an annotation pipeline for human genome re-sequencing studies

WGSA is an annotation pipeline for human genome re-sequencing studies, to facilitate the functional annotation step of whole genome sequencing (WGS). Currently WGSA supports the annotation of SNVs and indels locally without remote database requests, allowing it to scale up for large WGS studies. 

For gene-model based annotation, WGSA integrates the outputs from three annotation tools (ANNOVAR, SnpEff and VEP) for RefSeq and Ensembl GENCODE gene models, plus ANNOVAR outputs using UCSC knowGene model, and provides a summary of variant consequences from the seven annotation results. To further speed up the process for large-scale WGS studies, we have pre-computed annotations for all potential human SNVs (a total of 8,584,031,106 based on human reference hg19 and 8,812,967,043 based on human reference hg38) and use them as a local database (in the resources/precomputed and resources/precomputed_hg38 folder). For SNV-centric resources, WGSA integrated 12 sets of functional prediction scores (CADD, FATHMM-MKL, FATHMM-XF, Funseq, Funseq2, RegulomeDB, DANN, fitCons x 4, GenoCanyon, Eigen & Eigen-PC, GenoSkyline-Plus x 127, LINSIGHT), 9 conservation scores (bStatistic, GERP++, PhyloP x 3, phastCons x 3, SyPhy), allele frequencies from 5 large-scale re-sequencing studies (1000G, EP6500, ExAC, UK10K, gnomAD), variants in 4 disease related databases (ClinVar, COSMIC, GWAS_catalog, GRASP2), among others (see list of resources). For regulatory region-centric resources, WGSA integrated predicted regulatory regions from multiple epigenomics projects (see list of resources). WGSA also contains rich functional annotations for non-synonymous SNVs and genes from our dbNSFP database, including deleteriousness prediction scores from SIFT, SIFT4G, Polyphen2, LRT, MutationTaster2, MutationAssessor, FATHMM, MetaSVM, MetaLR, VEST4, PROVEAN, M-CAP, REVEL, MutPred, MVP, MPC, PrimateAI, GEOGEN2, ALoFT. 

WGSA is provided both as an Amazon Machine Image (AMI) ready to run out-of-the-box and a downloadable version. WGSA and its resources are freely available for academic usage. Licenses are required for non-academic usage some of the resources, such as ANNOVAR, CADD, GenoCanyon, GenoSkyline-PlusLINSIGHT, and VEST4Polyphen2, and REVEL (in dbNSFP). WGSA does not grant the non-academic usage of those resources, so please contact the original content provider for that purpose.

You can sign up for our email list for future release announcements.
Please report any bug you found or any suggestions or comments to xmliu.uth{at}gmail.com or xiaoming.liu{at}uth.tmc.edu.


Upcoming:

WGSA v0.8 will support new versions of ANNOVAR, SnpEff and VEP.



Current Version:

WGSA v0.76 AMI
  • WGSA pipeline AMI: WGSA076 (ami-02827ef1857a4740f in AWS US East region - N. Virginia)
  • A guidance for using WGSA via Amazon Web Service can be found here
  • Guidance for using external resources (COSMIC, SPIDEX, CADD indel) can be found here
WGSA v0.76 downloadable version
  • The downloadable version (total size of ~1.9 Tb) of is hosted at TACC
    • Due to the big size of resources, I strongly recommend using AMI or downloading only the resources you plan to use. Please read this guidance for downloading. 
  • A guidance for using the downloadable version can be found here
    • A guidance for installing WGSA on an external hard drive attached to a Linux machine can be found here
  • Guidance for using external resources (COSMIC, SPIDEX, CADD indel, dbNSFP) can be found here
Utilities for WGSA
  • A collection of utilities programs for using WGSA resources and post-processing WGSA annotations can be found here. (update 20170811)

Archives:

Archives of older versions of WGSA can be found here.  


Citation:
  1. Liu X*, White S, Peng B, Johnson AD, Brody JA, Li AH, Huang Z, Carroll A, Wei P, Gibbs R, Klein RJ and Boerwinkle E. (2016) WGSA: an annotation pipeline for human genome sequencing studies. Journal of Medical Genetics 53:111-112. [PDF] [preprint] *corresponding author

Changelog:

Update (December 12, 2018): WGSA v0.76 released. This is a minor update focused on the update of annotation resources. Please see the guidance before downloading. Major changes include
  1. The default dbNSFP in WGSA is now dbNSFP4.0b1c. WGSA also supports dbNSFP4.0b1a with dbNSFPa_variant option for academic usage of additional deleteriousness prediction scores including Polyphen2, VEST4 and REVEL. See this guidance for using dbNSFP4.0b1a.
  2. Aloft is no longer an independent resource but provided via dbNSFP.
  3. The default ANNOVAR program for indel annotation is now version 20180416, which supports Ensembl gene model for hg38. HGVSp presentation for indel is now supported.
  4. As fathmm-XF coding and noncoding scores are comparable, the two scores are now combined into one fathmm-XF score with additional information for its origin (coding or noncoding).
  5. CADD is updated to v1.4. 
  6. WGSA's CADD indel support now assumes the results are from CADD v1.4, which now support both hg19 and hg38. 
  7. gnomAD is updated to 2.1. The number of alt allele homozygotes and allele frequencies of controls subsets are now included. 
  8. clinvar is updated to 20180930.
  9. dbSNP is updated to b151.
  10. QTEx eQTL is updated v7.
  11. phyloP placental conservation score for hg38 is updated to 30way.
  12. phastCons placental conservation score for hg38 is updated to 30way.
  13. Added phyloP primate 17way conservation score for hg38. 
  14. Added phastCons primate 17way conservation score for hg38. 
  15. Added bStatistic for hg38.
  16. GWAS catalog is updated to e93.
  17. Known miRNA database miRdb is updated to 22.
  18. miRNA target database TargetScan is updated to v7.2.

Update (July 2, 2018): WGSA v0.75 released. This is a minor update focused on the addition of annotation resources. Please see the guidance before downloading. Major changes include
  1. While preparing input file, duplicated variants will be shown on screen but no longer removed from the input file.
  2. Added FATHMM-XF score, a whole genome deleteriousness prediction score.
  3. Added predicted regulatory elements (15-state and 25-state models) for 127 cell types from the Roadmap epigenomes.
  4. Added GeneHancer, predicted target genes for enhancers (and promoters). 
  5. Added eQTLs from the Geuvadis project.
  6. Added dbNSFP v3.4 collection of deleteriousness prediction scores for missense SNVs
  7. Added Aloft, a deleteriousness prediction score for stop-gain SNVs.
  8. Added LINSIGHT, a whole genome function prediction score.
  9. Added bStatistic, a measure of background selection and conservation based on comparative genomics.

Update (February 15, 2018): WGSA v0.71 released. This is a minor update for supporting gnomAD r2.0.2 and the ANNOVAR version of spidex. Changes include
  1. gnomAD updated to r2.02. One column added to the annotation indicating whether the variant is within low complexity region or segment duplication region. 
  2. The third-party spidex resource file changed to hg19_spidex.txt. Users can download this file from the ANNOVAR website.

Update (August 6, 2017): WGSA v0.7 released. This is a major update focused on hg38 support as well as resource updates. A clean re-download is recommended. Please see the guidance before downloading. Major changes include
  1. WGSA07 add options to specify whether the input file format is vcf or tsv and whether the variant coordinates are in hg19 or hg38. The full usage is java WGSA07 [setting_file] <-m maximum_number_of_GB_memory_to_use> <-t maximum_number_of_threads_to_use> <-v hg19_or_hg38> <-i vcf_or_tsv>
  2. Add support for coordinates of variants in hg38. For annotating variants with coordinates in hg38, annotation resources native in hg38 will be used if applicable, otherwise, the hg38 coordinates will be converted to hg19 coordinates then be annotated with those native in hg19. 
  3. Added allele frequencies of the Genome Aggregation Database (gnomAD)
  4. Added GenoSkyline-Plus scores, a tissue-specific deleteriousness prediction score for 127 cell types.
  5. Added topologically associated domains (TADs)
  6. Added Vindijia Neanderthal genotypes. Genotypes of Altai Neanderthal and Denisova updated.
  7. Ensembl Regulatory Build updated to Ensembl release 88
  8. dbSNP updated to b150
  9. GWAS catalog updated to e88_r2017-05-29
  10. clinvar updated to 20170530
  11. GTEx updated to v6p
  12. Eigen and EigenPC updated to v1.1
  13. ORegAnno updated to  2015.12.22
  14. funseq2 updated to 2.1.6
  15. dbNSFP updated to 2.9.3
  16. GenoCanyon updated to 1.0.2

Update (Sept. 21, 2016): WGSA v0.65 released. This update focused on annotation resources. Major changes include
  1. Eigen and EigenPC scores (Nat. Genet. 48, 214–220) added.
  2. GenoCanyon score (Sci. Rep. 5, 10576) added.6
  3. FANTOM5 enhancer target genes, promoter robust set (phase 1+2), enhancer expression (phase 1), enhancer robust set added.
  4. Super Enhancer (Cell 155, 934–947) added.
  5. Genome Mappability Score (GMS) added.
  6. Duke mappability scores averaged over 300bp windows added.
  7. dbSNP updated to build 147.
  8. ClinVar updated to 20160802.
  9. dbNSFP updated to v2.9.1.
  10. FANTOM5 enhancer updated to permissive set (phase 1+2).
  11. Support CADD indel annotation output. See the guidance here
The following folders under resources need to be updated if you have the downloaded version of WGSA06 on your computer:  
  1. The following folders have been added: Eigen, EigenPC, GenoCanyon, GMS, SuperEnhancer. 
  2. The files under the following folders have been modified: clinvar, dbNSFP, dbSNP, Duke_Mapability, FANTOM5, javaclass.
 A clean update, i.e. deleting all files under those folders (if exist) and re-download files from the updated resources, is recommended. 


Update (Mar. 3, 2016): WGSA v0.6 released. This update focused on gene model annotations. Major changes include
  1. ANNOVAR annotation results were updated as to its Dec. 2015 version. ANNOVAR program was updated to the Feb. 2016 version, which fixed the multiple-thread bug of its Dec. 2015 version.
  2. snpEff annotation results and the program were updated as to its version 4.2. The new 'ANN' format annotations were used.
  3. VEP annotation results and the program were updated as to its version 83. The results of the LoF plugin by LOFTEE are now included.
  4. Precomputed ANNOVAR annotation results for all SNVs with UCSC knowGene model is added. 
  5. Transcript-specific annotations are now included by default in the precomputed ANNOVAR/snpEff/VEP annotation results.
  6. Tool by gene model results now can be selected separately. That is, user can choose whether to include ANNOVAR/ensembl, ANNOVAR/refseq, ANNOVAR/ucsc, snpEff/ensembl, snpEff/refseq, VEP/ensembl, VEP/refseq results separately. Accordingly, the format of configuration file has changed. Please refer to the new format here
  7. User can choose whether to get gene model based annotations (i.e.  ANNOVAR/snpEff/VEP x ensembl/refseq/ucsc) of SNVs using precomputed results (recommended for large data sets) or using the annotation tools on-the-fly (faster for small data sets).
  8. User can choose a working directory to store the intermediate files. All intermediate files are gzipped files by default to save disk space. 
  9. Gzipped input files are now supported.
  10. Resources of GTEx eQTLs, Roadmap epigenomics peak calls (narrowPeaks for >1000 epigenomics data sets), and allele frequencies of the ExAC r0.3 nonTCGA and nonpsych subsets are added. 
  11. User need to download their own copy of COSMIC resource for annotation due to licence requirements. Please see the guidance here
The following folders under resources need to be updated if you have the downloaded version of WGSA055 on your computer:  
  1. The contents under the following folders have been deleted: COSMIC, IntegratedSNV, FAMTOM5 .
  2. The following folders have been added: GTEx, precomputed, Roadmap_peaks, FANTOM5.
  3. The files under the following folders have been modified: 1000Gmask, clinvar, dbSNP, ENCODE, EnhancerFinder, Ensembl_regulatory_build, ESP6500, ExACr0.3, GRASP, GWAS_catalog, hg19, human_ancestor_GRCh37_e71, javaclass, ORegAnno, repeatmasker, scSNV, snoRNA_miRNA.
      A clean update, i.e. deleting all files under those folders (if exist) and re-download files from the updated resources, is recommended. 


Update (Oct. 6, 2015): WGSA v0.55 released. This update focused on annotation resources. Major changes include 
  1. DANN, fitCons and EnhancerFinder resources added; 
  2. Support annotation using the SPIDEX free non-commercial version but independent license/download needed.
  3. Genome-wide ranks added to CADD, DANN, FATHMM-MKL, fitCons, funseq-2, GERP++, phastCons, phyloP, SiPhy; 
  4. CADD, dbSNP, snoRNA/miRNA, miRNA targets, clinvar updated; 
  5. Multiple alt alleles of indels in dbSNP, ExAC, ESP6500 and 1000Gp3 have been separated and left-normalized; 
  6. Annotation results using standard variant list file as input will retain all columns of the input file; 
  7. Allows options to turn off integrated annotations of ANNOVAR/SnpEff/VEP x Refseq/Ensembl; 
  8. Bugs fixed with repeat mask and 1000g mask annotation;
  9. In case there are multiple rows in dbNSFP match the variant, those rows are combined to a single row.
    The following folders under resources need to be updated if you have the downloaded version of WGSA05 on your computer:  1000Gp3,  CADDv1.3,  DANN,  ESP6500,  EnhancerFinder,  ExACr0.3,  GERP,  PhyloP,  SiPhy,  clinvar,  dbSNP,  fathmmMKL,  fitConsv1.01,  funseq2,  javaclass,  phastCons, snoRNA_miRNA. A clean update, i.e. deleting all files under those folders (if exist) and re-download files from the updated resources, is recommended.