WGS Annotator (WGSA) - an annotation pipeline for human genome re-sequencing studies

WGSA is an annotation pipeline for human genome re-sequencing studies, to facilitate the functional annotation step of whole genome sequencing (WGS). Currently WGSA supports the annotation of SNVs and indels locally without remote database requests, allowing it to scale up for large WGS studies. 

For gene-model based annotation, WGSA integrates the outputs from three annotation tools (ANNOVAR, SnpEff and VEP) for RefSeq and Ensembl GENCODE gene models, plus ANNOVAR outputs using UCSC knowGene model, and provides a summary of variant consequences from the seven annotation results. To further speed up the process for large-scale WGS studies, we have pre-computed annotations for all potential human SNVs (a total of 8,584,031,106 based on human reference hg19 and 8,812,967,043 based on human reference hg38) and use them as a local database (in the resources/precomputed and resources/precomputed_hg38 folder). For SNV-centric resources, WGSA integrated 10 sets of functional prediction scores (CADD, FATHMM-MKL, Funseq, Funseq2, RegulomeDB, DANN, fitCons x 4, GenoCanyon, Eigen & Eigen-PC, GenoSkyline-Plus x 127), 8 conservation scores (GERP++, PhyloP x 3, phastCons x 3, SyPhy), allele frequencies from 5 large-scale re-sequencing studies (1000G, EP6500, ExAC, UK10K, gnomAD), variants in 4 disease related databases (ClinVar, COSMIC, GWAS_catalog, GRASP2), among others (see list of resources). For regulatory region-centric resources, WGSA integrated cell type specific transcription factor binding sites, DNAse1 hypersensitivity regions and chromosome activity predictions from multiple epigenomics projects (see list of resources). WGSA also contains rich functional annotations for non-synonymous SNVs and genes from our dbNSFP database. 

WGSA is provided both as an Amazon Machine Image (AMI) ready to run out-of-the-box and a downloadable version. WGSA and its resources are freely available for academic usage. Licenses are required for non-academic usage some of the resources, such as ANNOVAR, Polyphen2CADD, DANN and VEST3 (in dbNSFP). WGSA does not grant the non-academic usage of those resources, so please contact the original content provider for that purpose.

You can sign up for our email list for future release announcements.
Please report any bug you found or any suggestions or comments to xmliu.uth{at}gmail.com or xiaoming.liu{at}uth.tmc.edu.


Upcoming:

WGSA v0.75 will support new versions of ANNOVAR, SnpEff and VEP.



Current Version:

WGSA v0.7 AMI
  • WGSA pipeline AMI: WGSA07 (ami-6b002c10 in N. Virginia)
  • A guidance for using WGSA via Amazon Web Service can be found here
  • Guidance for using external resources (COSMIC, SPIDEX, CADD indel) can be found here
WGSA v0.7 downloadable version
  • The downloadable version (total size of ~1.4 Tb) of is hosted at TACC
    • Due to the big size of resources, I strongly recommend using AMI or downloading only resources you plan to use. Please read this guidance for downloading. 
  • A guidance for using the downloadable version can be found here
    • A guidance for installing WGSA on an external hard drive attached to a Linux machine can be found here
  • Guidance for using external resources (COSMIC, SPIDEX, CADD indel) can be found here
Utilities for WGSA
  • A collection of utilities programs for using WGSA resources and post-processing WGSA annotations can be found here. (update 20170811)

Archives:

Archives of older versions of WGSA can be found here.  


Citation:
  1. Liu X*, White S, Peng B, Johnson AD, Brody JA, Li AH, Huang Z, Carroll A, Wei P, Gibbs R, Klein RJ and Boerwinkle E. (2016) WGSA: an annotation pipeline for human genome sequencing studies. Journal of Medical Genetics 53:111-112. [PDF] [preprint] *corresponding author

Changelog:

Update (August 6, 2017): WGSA v0.7 released. This is a major update focused on hg38 support as well as resource updates. A clean re-download is recommended. Please see the guidance before downloading. Major changes include
  1. WGSA07 add options to specify whether the input file format is vcf or tsv and whether the variant coordinates are in hg19 or hg38. The full usage is java WGSA07 [setting_file] <-m maximum_number_of_GB_memory_to_use> <-t maximum_number_of_threads_to_use> <-v hg19_or_hg38> <-i vcf_or_tsv>
  2. Add support for coordinates of variants in hg38. For annotating variants with coordinates in hg38, annotation resources native in hg38 will be used if applicable, otherwise, the hg38 coordinates will be converted to hg19 coordinates then be annotated with those native in hg19. 
  3. Added allele frequencies of the Genome Aggregation Database (gnomAD)
  4. Added GenoSkyline-Plus scores, a tissue-specific deleteriousness prediction score for 127 cell types.
  5. Added topologically associated domains (TADs)
  6. Added Vindijia Neanderthal genotypes. Genotypes of Altai Neanderthal and Denisova updated.
  7. Ensembl Regulatory Build updated to Ensembl release 88
  8. dbSNP updated to b150
  9. GWAS catalog updated to e88_r2017-05-29
  10. clinvar updated to 20170530
  11. GTEx updated to v6p
  12. Eigen and EigenPC updated to v1.1
  13. ORegAnno updated to  2015.12.22
  14. funseq2 updated to 2.1.6
  15. dbNSFP updated to 2.9.3
  16. GenoCanyon updated to 1.0.2

Update (Sept. 21, 2016): WGSA v0.65 released. This update focused on annotation resources. Major changes include
  1. Eigen and EigenPC scores (Nat. Genet. 48, 214–220) added.
  2. GenoCanyon score (Sci. Rep. 5, 10576) added.6
  3. FANTOM5 enhancer target genes, promoter robust set (phase 1+2), enhancer expression (phase 1), enhancer robust set added.
  4. Super Enhancer (Cell 155, 934–947) added.
  5. Genome Mappability Score (GMS) added.
  6. Duke mappability scores averaged over 300bp windows added.
  7. dbSNP updated to build 147.
  8. ClinVar updated to 20160802.
  9. dbNSFP updated to v2.9.1.
  10. FANTOM5 enhancer updated to permissive set (phase 1+2).
  11. Support CADD indel annotation output. See the guidance here
The following folders under resources need to be updated if you have the downloaded version of WGSA06 on your computer:  
  1. The following folders have been added: Eigen, EigenPC, GenoCanyon, GMS, SuperEnhancer. 
  2. The files under the following folders have been modified: clinvar, dbNSFP, dbSNP, Duke_Mapability, FANTOM5, javaclass.
 A clean update, i.e. deleting all files under those folders (if exist) and re-download files from the updated resources, is recommended. 


Update (Mar. 3, 2016): WGSA v0.6 released. This update focused on gene model annotations. Major changes include
  1. ANNOVAR annotation results were updated as to its Dec. 2015 version. ANNOVAR program was updated to the Feb. 2016 version, which fixed the multiple-thread bug of its Dec. 2015 version.
  2. snpEff annotation results and the program were updated as to its version 4.2. The new 'ANN' format annotations were used.
  3. VEP annotation results and the program were updated as to its version 83. The results of the LoF plugin by LOFTEE are now included.
  4. Precomputed ANNOVAR annotation results for all SNVs with UCSC knowGene model is added. 
  5. Transcript-specific annotations are now included by default in the precomputed ANNOVAR/snpEff/VEP annotation results.
  6. Tool by gene model results now can be selected separately. That is, user can choose whether to include ANNOVAR/ensembl, ANNOVAR/refseq, ANNOVAR/ucsc, snpEff/ensembl, snpEff/refseq, VEP/ensembl, VEP/refseq results separately. Accordingly, the format of configuration file has changed. Please refer to the new format here
  7. User can choose whether to get gene model based annotations (i.e.  ANNOVAR/snpEff/VEP x ensembl/refseq/ucsc) of SNVs using precomputed results (recommended for large data sets) or using the annotation tools on-the-fly (faster for small data sets).
  8. User can choose a working directory to store the intermediate files. All intermediate files are gzipped files by default to save disk space. 
  9. Gzipped input files are now supported.
  10. Resources of GTEx eQTLs, Roadmap epigenomics peak calls (narrowPeaks for >1000 epigenomics data sets), and allele frequencies of the ExAC r0.3 nonTCGA and nonpsych subsets are added. 
  11. User need to download their own copy of COSMIC resource for annotation due to licence requirements. Please see the guidance here
The following folders under resources need to be updated if you have the downloaded version of WGSA055 on your computer:  
  1. The contents under the following folders have been deleted: COSMIC, IntegratedSNV, FAMTOM5 .
  2. The following folders have been added: GTEx, precomputed, Roadmap_peaks, FANTOM5.
  3. The files under the following folders have been modified: 1000Gmask, clinvar, dbSNP, ENCODE, EnhancerFinder, Ensembl_regulatory_build, ESP6500, ExACr0.3, GRASP, GWAS_catalog, hg19, human_ancestor_GRCh37_e71, javaclass, ORegAnno, repeatmasker, scSNV, snoRNA_miRNA.
      A clean update, i.e. deleting all files under those folders (if exist) and re-download files from the updated resources, is recommended. 


Update (Oct. 6, 2015): WGSA v0.55 released. This update focused on annotation resources. Major changes include 
  1. DANN, fitCons and EnhancerFinder resources added; 
  2. Support annotation using the SPIDEX free non-commercial version but independent license/download needed.
  3. Genome-wide ranks added to CADD, DANN, FATHMM-MKL, fitCons, funseq-2, GERP++, phastCons, phyloP, SiPhy; 
  4. CADD, dbSNP, snoRNA/miRNA, miRNA targets, clinvar updated; 
  5. Multiple alt alleles of indels in dbSNP, ExAC, ESP6500 and 1000Gp3 have been separated and left-normalized; 
  6. Annotation results using standard variant list file as input will retain all columns of the input file; 
  7. Allows options to turn off integrated annotations of ANNOVAR/SnpEff/VEP x Refseq/Ensembl; 
  8. Bugs fixed with repeat mask and 1000g mask annotation;
  9. In case there are multiple rows in dbNSFP match the variant, those rows are combined to a single row.
    The following folders under resources need to be updated if you have the downloaded version of WGSA05 on your computer:  1000Gp3,  CADDv1.3,  DANN,  ESP6500,  EnhancerFinder,  ExACr0.3,  GERP,  PhyloP,  SiPhy,  clinvar,  dbSNP,  fathmmMKL,  fitConsv1.01,  funseq2,  javaclass,  phastCons, snoRNA_miRNA. A clean update, i.e. deleting all files under those folders (if exist) and re-download files from the updated resources, is recommended.