Using resource files directly
Resource files are typically gzipped plain text file. You can write your own program to use those files for your own annotations.
Using pre-computed gene-model based annotations (release 20160921)
To save disk space, precomputed gene model based SNV annotations were compressed twice (with file name extension .lz4x2) using the Java realization of LZ4 algorithm by Adrien Grand. You can use lz4x2decompress.class to decompress lz4x2 files or use lz4x2togz.class to convert them to gzipped files. Usage examples:
java -cp .:lz4-1.3.0.jar lz4x2decompress file_name.lz4x2
java -cp .:lz4-1.3.0.jar lz4x2togz file_name.lz4x2
lz4-1.3.0.jar can be downloaded from Maven Central
Using add_ref_allele_commandline.class (release 20170811)
Add a column to a variant list file (with chr and pos columns) using the reference fasta files from the hg19 or hg38 folders
Requirements for the input file:
A chr column
A pos column
All columns separated by TAB
Usage: java add_ref_allele_commandline [input_file] [input_file hastitle(true or false)] [chrcol] [poscol] [reference_fasta_directory]
Example: java -Xmx20g add_ref_allele_commandline my_snp.tsv true 1 2 /WGSA/resources/hg19
input_file – name of the input file. Plain text file or gzipped plain text file (with extension .gz)
input_file hastitle – whether the input file has a title row (true or false)
chrcol – column number of the chr column in the input file (e.g. if chr column is the first column, chrcol is 1)
poscol – column number of the pos column in the input file (e.g. if pos column is the second column, poscol is 2)
reference_fasta_directory – full directory to the folder containing the reference fasta files (i.e. hg19 or hg38 under the resources folder)
This program requires large memory, use -Xmx to specify the size of memory designated to the program, e.g. -Xmx20g means a maximum of 20g memory is available for the program.
Utilities for post-processing WGSA annotations
Here are some examples for post-processing WGSA annotations. Source codes are available upon request.
Using add_bedlike_annotation_Yes_No_commandline.class (release 20160921)
Add a Yes/No column to a variant list file (with chr and pos columns) against a BED-like file
Requirements for the input file:
A chr column
A pos column
All columns separated by TAB
At most one title row
Usage: java add_bedlike_annotation_Yes_No_commandline [input_file] [input_file hastitle(true or false)] [chrcol] [poscol] [bedlike_file] [first_data_row] [bed_chr_col] [anno_title]
Example: java add_bedlike_annotation_Yes_No_commandline ./snplist.txt true 1 2 ./1.narrowPeak.gz 1 1 narrowPeak1
output file name will be snplist.txt.addnarrowPeak1
input_file – name of the input file. Plain text file or gzipped plain text file (with extension .gz)
input_file hastitle – whether the input file has a title row (true or false)
chrcol – column number of the chr column in the input file (e.g. if chr column is the first column, chrcol is 1)
poscol – column number of the pos column in the input file (e.g. if pos column is the second column, poscol is 2)
bedlike_file – name of the bed-like file. Plain text file or gzipped plain text file (with extension .gz).
Three BED fields required:
Chrom - The name of the chromosome
chromStart - The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature.
Relaxed BED format:
Allow (multiple) header rows at the top of the file
Chrom column is not necessarily the first column (but still needs to be followed by chromStart and chromEnd)
first_data_row – the row number of the first data row in the bed-like file (e.g. for a standard bed file, first_data_row is 1)
bed_chr_col – the column number of the Chrom column in the bed-like file (e.g. for a standard bed file, bed_chr_col is 1)
anno_title – title of the column to be added to the input file (e.g. narrowPeak1)
Using add_bedlike_annotation_Yes_No_GUI.class (release 20160921)
Add a Yes/No columns to multiple variant list files against multiple BED-like files using a graphic user interface
Requirements for the input file:
A chr column
A pos column
All columns separated by TAB
At most one title row
Usage (Linux): java –cp .:simpleGUI.jar add_bedlike_annotation_Yes_No_GUI
Usage (Windows): java –cp .;simpleGUI.jar add_bedlike_annotation_Yes_No_GUI
Input files and bed-like files can be plain text files or gzipped plain text files (with extension .gz)
Name of output file is input_file_name.addbed
Using add_bedlike_annotation_commandline3.class (release 20170811)
Add multiple column contents from a BED-like file to a variant list file (with chr and pos columns)
Requirements for the input file:
A chr column
A pos column
All columns separated by TAB
At most one title row
Usage: java add_bedlike_annotation_commandline3 [input_file] [input_file hastitle(true or false)] [chrcol] [poscol] [bedlike_file] [first_data_row] [bed_chr_col] [annocols (separated by ,)] [anno_titles (separated by ,)] [outfile_stem] [bedlike_file_column_delimiter (b for blank, t for TAB)]
Example: java add_bedlike_annotation_commandline3 ./snplist.txt true 1 2 ./1.narrowPeak.gz 1 1 4,7 name1,signalValue1 narrowPeak1 t
output file name will be snplist.txt.addnarrowPeak1
input_file – name of the input file. Plain text file or gzipped plain text file (with extension .gz)
input_file hastitle – whether the input file has a title row (true or false)
chrcol – column number of the chr column in the input file (e.g. if chr column is the first column, chrcol is 1)
poscol – column number of the pos column in the input file (e.g. if pos column is the second column, poscol is 2)
bedlike_file – name of the bed-like file. Plain text file or gzipped plain text file (with extension .gz).
Three BED fields required:
Chrom - The name of the chromosome
chromStart - The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature.
Relaxed BED format:
Allow (multiple) header rows at the top of the file
Chrom column is not necessarily the first column (but still needs to be followed by chromStart and chromEnd)
first_data_row – the row number of the first data row in the bed-like file (e.g. for a standard bed file, first_data_row is 1)
bed_chr_col – the column number of the Chrom column in the bed-like file (e.g. for a standard bed file, bed_chr_col is 1)
annocols – column numbers (separated by “,”) in the bed-like file whose contents are used for annotation (e.g. using the name and signalValue columns of a narrowPeak file annocols is 4,7)
anno_titles – the titles for the columns to be added to the input file (separated by “,”) corresponding to annocols (e.g. using the name and score columns of a standard bed file anno_titles can be name1,signalValue1)
outfile_stem – specify the output extension, which will be .addoutfile_stem
bedlike_file_column_delimiter – either b or t, specifying whether the bedlike file is delimited by blanks (continuous spaces and/or TABs) or TABs.
Using WGSAprogram31_CANONICAL_only4_commandline.class (release 20170811)
Simplify the gene-model based annotation by retaining only transcript-specific annotations for VEP defined canonical transcripts.
Requirements for the input file:
the input file must be the output annotation of WGSA and at least contain the VEP ensembl or VEP refseq annotation columns
Usage: java WGSAprogram31_CANONICAL_only4_commandline WGSA_output_file
Example: java WGSAprogram31_CANONICAL_only4_commandline All.chr21.hg38.phase3.+AC+AN.1alt.left-normalized.vcf.gz.annotated.indel.gz
WGSA_output_file – the output file of WGSA 06 and up
Please note if the variant does not affect any canonical transcripts (e.g. intergenic or a variant only affecting non-canonical transcripts) all gene-based annotation columns will be set missing, i.e. ".".
Using add_VEP_most_damaging_ensembl_commandline3.class (release 20190411)
Simplify the consequence interpretation by identifying the most damaging consequence based on VEP's ensembl annotation for each gene.
Please note in the output file each variant may have multiple rows, each row for a gene it affects. To trim back to one-variant-one row, retain only the rows with unique_variant="Y" (see below).
Requirements for the input file:
the input file must be the output annotation of WGSA and at least contain the VEP ensembl annotation columns
Usage: java add_VEP_most_damaging_ensembl_commandline3 [WGSA_output_file]
Example: java add_VEP_most_damaging_ensembl_commandline3 All.chr21.hg38.phase3.+AC+AN.1alt.left-normalized.vcf.gz.annotated.indel.gz
WGSA_output_file – the output file of WGSA 06 and up
Columns added to the output file:
VEP_ensembl_transcript_precedent_consequence: the most severe consequence of all transcripts affected based on VEP (83) annotation. Multiple consequences are separated by ",". The order of severity is according to this rank
VEP_ensembl_precedent_consequence: the most severe consequence in VEP_ensembl_transcript_precedent_consequence. The order of severity is according to this rank
VEP_ensembl_precedent_gene: gene name associated with VEP_ensembl_transcript_precedent_consequence
unique_variant: "Y" for the most "damaging" consequence/gene of the variant; "N" for other consequences/genes
Using add_annovar_most_damaging_refseq_commandline.class (release 20170811)
Simplify the consequence interpretation by identifying the most damaging consequence based on ANNOVAR's refseq annotation for each gene.
Please note in the output file each variant may have multiple rows, each row for a gene it affects. To trim back to one-variant-one row, retain only the rows with unique_variant="Y" (see below).
Requirements for the input file:
the input file must be the output annotation of WGSA and at least contain the ANNOVAR refseq annotation columns
Usage: java add_annovar_most_damaging_refseq_commandline [WGSA_output_file]
Example: java add_annovar_most_damaging_refseq_commandline All.chr21.hg38.phase3.+AC+AN.1alt.left-normalized.vcf.gz.annotated.indel.gz
WGSA_output_file – the output file of WGSA 06 and up
Columns added to the output file:
ANNOVAR_refseq_precedent_consequence: the most severe consequence of all transcripts affected based on ANNOVAR annotation. Multiple consequences are separated by ",". The order of severity is according to this rank
ANNOVAR_refseq_precedent_gene: gene name associated with ANNOVAR_refseq_precedent_consequence
unique_variant: "Y" for the most "damaging" consequence/gene of the variant; "N" for other consequences/genes
Using add_annovar_most_damaging_ucsc_commandline2.class (release 20170811)
Simplify the consequence interpretation by identifying the most damaging consequence based on ANNOVAR's ucsc annotation for each gene.
Please note in the output file each variant may have multiple rows, each row for a gene it affects. To trim back to one-variant-one row, retain only the rows with unique_variant="Y" (see below).
Requirements for the input file:
the input file must be the output annotation of WGSA and at least contain the ANNOVAR ucsc annotation columns
Usage: java add_annovar_most_damaging_ucsc_commandline2 [WGSA_output_file]
Example: java add_annovar_most_damaging_ucsc_commandline2 All.chr21.hg38.phase3.+AC+AN.1alt.left-normalized.vcf.gz.annotated.indel.gz
WGSA_output_file – the output file of WGSA 06 and up
Columns added to the output file:
ANNOVAR_ucsc_precedent_consequence: the most severe consequence of all transcripts affected based on ANNOVAR annotation. Multiple consequences are separated by ",". The order of severity is according to this rank
ANNOVAR_ucsc_precedent_gene: gene name associated with ANNOVAR_ucsc_precedent_consequence
unique_variant: "Y" for the most "damaging" consequence/gene of the variant; "N" for other consequences/genes
Using add_dbNSFP_gene_commandline.class (release 20190223)
Add gene annotations from dbNSFP_gene table
Requirements for the input file:
A column with gene name, i.e. HGNC gene symbol
Usage: java add_dbNSFP_gene_commandline [input_file] [input_file hastitle(true or false)] [gene name column number] [dbNSFP_gene file]
Example: java add_dbNSFP_gene_commandline All.chr21.hg38.phase3.+AC+AN.1alt.left-normalized.vcf.gz.annotated.indel.VEP_precedence_ensembl.gz true 126 /WGSA/resources/dbNSFP/dbNSFP4.0b2_gene.complete.gz
input_file - name of the input file. Plain text file or gzipped plain text file (with extension .gz)
input_file hastitle – whether the input file has a title row (true or false)
gene name column number - column number of the gene name column in the input file (e.g. if gene name column is the first column, gene name column number is 1)
dbNSFP_gene file - the (path to the) dbNSFP_gene file
Columns added to the output file:
All columns from the dbNSFP_gene file, from Gene_old_names to the end.