Using WGSA via Amazon Web Service (AWS)

EQUIPMENT

Computing environment

· A computer with an internet connection.

· A secure shell (SSH) client installed in the operating system (e.g. PuTTY).

· A SCP or SFTP client installed in the operating system (e.g. FileZilla).

EQUIPMENT SETUP

Create an AWS account

· If you already have an AWS account, skip this step.

· Following the steps at http://aws.amazon.com/ to create an account.

· A graphical guidance of this step can be found at https://sites.google.com/site/jpopgen/wgsa/create-an-aws-account.

Launch an instance from an AMI of WGSA

· Sign in to your AWS account. Navigate to the EC2 Dashboard.

· Search for a WGSA public AMI and choose to launch an instance from it. A list of available WGSA AMI can be found at https://sites.google.com/site/jpopgen/wgsa.

· Configure the type and details of the instance and launch the instance.

· A graphical guidance of this step can be found at https://sites.google.com/site/jpopgen/wgsa/launch-an-instance.

Terminate an instance (after the PROCEDURE is finished)

· Navigate to the EC2 Dashboard.

· Select the instance and choose “Terminate” from “Instance State”.

· Find the EBS volume that is leftover from the instance and choose “Delete Volume”.

· A graphic guidance of this step can be found at https://sites.google.com/site/jpopgen/wgsa/terminate-an-instance.

PROCEDURE

1| Prepare input files ▲CRITICAL

· Two input files are needed. One is a variant file and the other is a configuration/setting file.

· The standard variant file is a plain text format file with TAB-delimited columns (tsv format). The first row must be a title row and then followed by variant rows, with one variant per row. The first four columns must be chromosome, position, reference allele and alternative allele, with their formats defined by the vcf file format. Multiple alternative alleles for the same reference allele shall be separated into multiple rows. Additional columns can be included (since WGSA0.55 those columns will be retained in the final output file). An example is shown below. An example file can be downloaded here.

· Alternatively a vcf format file can be used as a variant file. The pipeline will automatically recognize its format if the file has an extension .vcf or .vcf.gz (since WGSA06). The user can specify the format using "-i vcf" option. The vcf file will be converted to a standard variant file (only the four columns of the chromosome, position, reference allele and alternative allele).

· A setting/configuration file is a plain text format file, in which the users provide information for the name of the input file, name of the output file, directory to various resources and options for annotation. The first 35 lines of a configuration file are shown below. Example template files can be found here. The file is self-explanatory and users shall change the contents of the second column to specify their options. The available options are given in the third column (beginning with #). (! CAUTION To run the pipeline on a local machine, the directories settings (line 3 to 9) shall be modified to reflect the absolute paths to the corresponding directories on the local machine.)

· Upload the variant file and the configuration file to the folder where WGSA##.class resides, e.g. /WGSA.

2| Upload input files

· SSH to the machine with WGSA installed (e.g. a WGSA AMI instance).

· Upload the input files with SCP or SFTP.

· A graphical guidance for accessing a WGSA instance using PuTTY and FileZilla as examples can be found at https://sites.google.com/site/jpopgen/wgsa/ssh-and-sftp-to-an-instance. Guidance for Linux environment or across platforms using MindTerm can be found at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstances.html.

3| Create the pipeline shell script

· Within the terminal, change the directory to the WGSA folder where the WGSA##.class resides (i.e. /WGSA for a WGSA AMI instance)

o cd /WGSA

· Run the WGSA main program (e.g. WGSA095.class) following the configuration file name, for example, test1000g-hg38-WGSA095.EC2.setting, which will first prepare some intermediate files in the working directory, then create a shell script with the name test1000g-hg38-WGSA095.EC2.setting.sh, a text files with descriptions for the columns of the final SNV annotation file (if applicable) and a text file with descriptions for the columns of the final indel annotation file (if applicable). The usage of WGSA095 is

o java WGSA095 [setting_file] <-m maximum_number_of_GB_memory_to_use> <-t maximum_number_of_threads_to_use> <-v hg19_or_hg38> <-i vcf_or_tsv>

· By default, the input file is assumed to be in tsv format and the coordinates of the variants are in hg38. You can use -i vcf and -v hg19 to specify that the input file is in vcf format and the coordinates are in hg19. By default, WGSA tries to use the maximum available RAM and CPU threads to run the annotation when applicable. If you want to limit the maximum memory and threads available to WGSA, you can use -m and -t options to specify the maximum memory (in GiB) and the number of threads. For example, here is an example setting a maximum of 30 GiB memory and 4 threads to the pipeline:

o java -Xmx30g WGSA095 test1000g-hg38-WGSA095.EC2.setting -i vcf -v hg38 -m 30 -t 4

4| Run the pipeline shell script

· Run the shell script created in 3|, for example,

o bash test1000g-hg38-WGSA095.EC2.setting.sh

· (Optional) We recommend running in background and saving the standard output and error messages to files for record purpose.

o nohup bash test1000g-hg38-WGSA095.EC2.setting.sh >output.txt 2>error.txt &

· The time needed for finishing all annotation steps may take hours, even days, mostly depending on the total number of variants to be annotated and the annotation steps users specified in the configuration file. The time spent for annotating the 1000 Genomes chromosome 21 subset (1,063,067 SNVs and 47,492 indels) with an experimental run with WGSA095 is shown here.

5| Download output files

· SNV and indel annotation files will be outputted separately, as well as the descriptions of their columns (see the example column description files for the above test1000g-hg38-WGSA095.EC2.setting: Column description for SNV annotation files and Column description for indel annotation files).

· If intermediate files were chosen to be retained, intermediate files at each step will be retained. They can be useful if the annotation pipeline is interrupted (e.g. when using spot priced instances). Make sure you have enough disk space for the intermediate files, which can be huge.

· Download the output files using SFTP or SCP (see step 2|).