Setting up WGSA on a local Linux machine

EQUIPMENT

Computer hardware

· This pipeline is not computational intensive but memory sensitive. The memory requirement depends on the number of variants to be annotated. For large whole genome re-sequencing studies with tens of millions of variants to be annotated, 32 GB or even larger memory may be required. By default, the pipeline will maximize the CPU and RAM resources available for the process. If the machine to be installed is not entirely devoted to this pipeline, users have an option to set the maximum memory and threads available to the pipeline.

· It is possible to install WGSA on an external hard drive attached to a Linux machine. Please refer to this guidance.

Operating system

· We assume users have a Unix-like operating system with a bash shell on the machine for installing this pipeline. We have successfully installed it on machines running Ubuntu and SUSE Linux Enterprise. It might be possible to install the pipeline following this protocol on MacOS X or Microsoft Windows with a Unix-like environment such as Cygwin (https://cygwin.com/) but additional steps may be required. This protocol shows the installation of ANNOVAR (June 08, 2020), SnpEff v4.3t and VEP v100 on SUSE Linux Enterprise Server 15 (AWS EC2 ami-030460dca5b954dda) with bash shell as a model system. All commands below shall be run in a terminal window.

Software

· If only SNV annotations are needed, Java 1.8 or higher is the only required software.

· To run ANNOVAR, SnpEff and VEP for indel annotations (or for SNV annotations on-the-fly), Perl and Java 1.8 or higher is needed, as well as the main packages and gene models for ANNOVAR, SnpEff and VEP.

EQUIPMENT SETUP

Folders for the pipeline

· (Optional) We recommend to put all annotation resources within a folder dedicated for the pipeline, such as /WGSA

o mkdir /WGSA

· (Optional) We recommend to create a working directory within a folder dedicated for the pipeline, such as /WGSA, for storing intermediate files

o mkdir /WGSA/work

o chmod 777 /WGSA/work

· A tmp folder with writing permission is required, such as /WGSA/tmp, (required for annotating indels with VEP)

o mkdir /WGSA/tmp

o chmod 777 /WGSA/tmp

Install ANNOVAR (required for annotating indels with ANNOVAR or annotating SNVs with ANNOVAR on-the-fly)

· Download the ANNOVAR main package from http://www.openbioinformatics.org/annovar/annovar_download.html. Please note a license is needed for commercial use of ANNOVAR.

· The package comes as annovar.latest.tar.gz, save it to /WGSA/annovar. Unzip it to /WGSA/annovar20200608:

o mkdir /WGSA/annovar20200608

o cd /WGSA/annovar20200608

o tar -zxvf annovar.latest.tar.gz

· Download RefSeq and Ensembl gene models for ANNOVAR:

o cd /WGSA/annovar20200608/annovar

o perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/

o perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ensGene humandb/

o perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar knownGene humandb/

o perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar refGene humandb/

o perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar ensGene humandb/

o perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar knownGene humandb/

Install SnpEff (required for annotating indels with SnpEff or annotating SNVs with SnpEff on-the-fly)

· Download SnpEff v4.3t main package and save the zip file to /WGSA/snpeff:

o mkdir /WGSA/snpeff

o cd /WGSA/snpeff

o wget http://sourceforge.net/projects/snpeff/files/snpEff_v4_3t_core.zip

o unzip snpEff_v4_3t_core.zip

· Check whether Java has been installed:

o java -version

In case Java is not installed:

o sudo zypper install java-10-openjdk

· Download RefSeq and Ensembl gene models for SnpEff:

o cd /WGSA/snpeff/snpEff

o java -jar snpEff.jar download -v hg19

o java -jar snpEff.jar download -v GRCh37.75

o java -jar snpEff.jar download -v hg38

o java -jar snpEff.jar download -v GRCh38.86

Install VEP (required for annotating indels with VEP or annotating SNVs with VEP on-the-fly)

· Download VEP 100 main package and save it to /WGSA/vep:

o mkdir /WGSA/vep

o cd /WGSA/vep

o wget https://github.com/Ensembl/ensembl-vep/archive/release/100.zip

o unzip 100.zip

· Install some additional Perl modules that may be required for successfully installing VEP API. Here are some example commands for SUSE Linux:

o sudo zypper install perl-JSON

o sudo zypper install perl-Archive-Zip

o sudo zypper install perl-DBD-mysql

o sudo zypper install make

o sudo zypper install gcc

o sudo zypper install zlib-devel

o sudo zypper install libbz2-devel

o sudo zypper install xz-devel

· Install htslib, which is required for VEP API.

o mkdir /WGSA/htslib

o cd /WGSA/htslib

o wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2

o tar -vxjf htslib-1.9.tar.bz2

o cd htslib-1.9

o sudo make prefix=/usr/ install

· Install VEP API to /WGSA/vep and download RefSeq and Ensembl gene models to /WGSA/.vep

o cd /WGSA/vep/ensembl-vep-release-100/

o mkdir /WGSA/.vep

o sudo perl INSTALL.pl -c /WGSA/.vep --ASSEMBLY GRCh37

o Go through the steps of the installing process and following the guidance at http://useast.ensembl.org/info/docs/tools/vep/script/vep_tutorial.html. When being asked for the cache files, choose “407 : homo_sapiens_merged_vep_100_GRCh37.tar.gz”. When being asked for fasta files, choose “27 : homo_sapiens”. When being asked for the plugins, choose "8:LOF". The fasta file downloading is required for the current version of WGSA.

o sudo perl INSTALL.pl -c /WGSA/.vep --ASSEMBLY GRCh38

o When being asked for the cache files, choose "408 : homo_sapiens_merged_vep_100_GRCh38.tar.gz". When being asked for fasta files, choose “110: homo_sapiens”. When being asked for the plugins, choose "n" as LOF has already been installed.

o sudo chmod 777 /WGSA/.vep/Plugins

o sudo chmod 777 /WGSA/.vep/homo_sapiens/100_GRCh37

o sudo chmod 777 /WGSA/.vep/homo_sapiens/100_GRCh38

· Install LOFTEE LOF plugin for VEP API

o cd /WGSA/.vep/Plugins

o wget https://github.com/konradjk/loftee/archive/v0.1.1-beta.zip

o unzip -j v0.1.1-beta.zip

o rm v0.1.1-beta.zip

· Use bgzip to re-compress GRCh37 fasta file

o cd /WGSA/.vep/homo_sapiens/100_GRCh37

o gunzip Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

o bgzip Homo_sapiens.GRCh37.75.dna.primary_assembly.fa

Install miniconda (required for MetaRNN and StrVCTVRE)

wget https://repo.anaconda.com/miniconda/Miniconda3-4.3.11-Linux-x86_64.sh
bash Miniconda3-4.3.11-Linux-x86_64.sh
log out the terminal and re login terminal
conda update conda
conda install -c anaconda _libgcc_mutex

Install MetaRNN (required for pathogenicity prediction of non-frameshift Indels with MetaRNN)

cd /WGSA
wget -c https://usf.box.com/shared/static/orit64mmjyrrp7846xch0rr0au0vyont -O MetaRNN.tar.gz
tar -xf MetaRNN.tar.gz
cd MetaRNN
conda env create -f environment.yml
test run with the following commands:
- source activate MetaRNN
- python ./MetaRNN.py hg38 test.vcf
- conda deactivate

Install StrVCTVRE (required for pathogenicity prediction of DUP or DEL SVs with StrVCTVRE)

cd /WGSA
wget https://github.com/andrewSharo/StrVCTVRE/archive/v.1.7.tar.gz
tar -xzf v.1.7.tar.gz
cd StrVCTVRE-v.1.7/data
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg38/phyloP100way/hg38.phyloP100way.bw
cd ..
conda env create -f environment_py2.7.yml
test run with the following commands:
- source activate StrVCTVRE_py_2.7
- python test_StrVCTVRE.py
- conda deactivate

Download the pipeline programs and other resources

· The main pipeline program is WGSA##.class where ## is the version number. We recommend putting the main program under /WGSA. The download links for the pipeline program and other resources are provided at https://sites.google.com/site/jpopgen/wgsa.

· Which resources need to be downloaded depends on which resources users want to use for their annotation. The currently available resources can be found here. Three resource folders, javaclass, hg19 and hg38, are necessary for running the pipeline and shall be downloaded. For precomputed SNV annotations from ANNOVAR, SnpEff and VEP, users shall download precomputed and precomputed_hg38 for hg19 and hg38, respectively.

· To use SPIDEX free commercial version 1 with WGSA##, users need to obtain their own copy of the data set. Please follow this guidance.

· To annotate variants with frequencies of somatic coding mutations in COSMIC, users need to obtain their own copy of the data set. Please follow this guidance.

· To use CADD indel score, please follow this guidance.

· We recommend putting all downloaded resources under the folder /WGAS/resources.

PROCEDURE

Please refer to the PROCEDURE subsection of Using WGSA via AmazonWeb Service.