Building High-Confidence Mutation Call Sets v1.2

Release v1.2

Latest reference call set along with the high-confidence regions be obtained from the NCBI server.

All-Inclusive Call Sets

These two files includes high-confidence calls (labeled PASS in the FILTER column), as well as low-confidence and likely false positive calls. Note that the PASS label is different from the AllPASS label. AllPASS is an intermediate label that denotes variants that SomaticSeq has classified as a somatic mutation in each of the 63 tumor-normal BAM files. PASS label denotes high-confidence somatic calls that are deemed "true somatic mutations."

High-Confidence Somatic Mutations

These files include only the high-confidence somatic mutation calls in the high-confidence regions. They can be used as truth sets along with the high-confidence regions file.

High-Confidence Regions

This file contains the regions of the genome where somatic mutations could be confidently determined. It excludes uncallable and unresolved genome regions. Variant calls outside these regions could not be confidently resolved.

2021 MAQC Society 4th Annual Meeting

A 25-minute talk to describe how the reference call set was established on a high level

Improvements in v1.2 over v1.1

We used PacBio data to determine if our high-confidence somatic mutation calls are inconsistent with the orthogonal long-read technology. For high-VAF calls in low-mapping score (MQ<50) and low sequence complexity regions, we demoted their confidence level when they were unsupported by PacBio data (p < 0.05 for one-tailed two-proportion z-test when PacBio has no supporting read), resulting in the removal of 33 SNVs and 11 indels from the reference call set due to discordance from PacBio data.

Due to the relatively high error rate and low sequencing depth of our PacBio data, there are calls with zero variant read in PacBio. However, if p > 0.05 for these variant calls, they were not removed from the high-confidence call set, because we do not deem them to be inconsistent with PacBio data. 

Steps to create the high-confidence call set v1.2

Step 1) Create SomaticSeq classifiers for each sequencing site (i.e., data group) that produced technical replicates

The detailed documentation of all those runs are described in detail in the page: BAM Simulation pipeline in SomaticSeq, which not only describes our commands, but has link to our archived run scripts.

Files related to this step uploaded to CGC are in FDA_SEQC2_WG#1 projects with tags of "bamsurgeon."

Examples

Step 2) Run six somatic mutation callers for each of the 63 tumor-normal pairs on CGC

Example CGC Workflow to run MuTect2, SomaticSniper, VarDict, MuSE, and Strelka on a pair of tumor-normal data set:

Example of CGC Workflow to run TNscope 201711.02:

The details of somatic mutation workflows on CGC, and the location of the result output files are described in this page: Somatic Mutation Results on Seven Bridges Genomics

Step 3) Use SomaticSeq to classify mutation calls for each of the 63 call sets

This step was also done locally at Roche Sequencing. However, the results as well as working examples are on the CGC.

For each of the 54 data sets that came from a sequencing center with replicates, i.e., IL/NS/FD/NV with BWA/Bowtie/NovoAlign, the corresponding SomaticSeq classifier was used to score each variant. For EA/NC/LL data, the classifiers trained by combining IL/NS/FD/NV data sets were used. 

Documentations and files for archival purposes:

Step 4) Modify the VCF file format for each of the 63 call set, so they can be merged later

We moved useful items in INFO to sample columns, in order to preserve such information after combining the VCF files using GATK3 CombineVariants. The script is in the seqc2 branch of SomaticSeq.

seqc2/somaticseq/utilities/reformat_VCF2SEQC2.py \

-infile  WGS.bwa.dedup-FD_T_1_vs_FD_N_1-SomaticSeq.sSNV.vcf \

-outfile reFormat.sSNV.predicted.by.bwa.IL1N_vs_IL2N.twoWay.vcf \

-callers MSDUKT \

-tumor   FD_T_1.bwa \

-trained

The commands/scripts to invoke SomaticSeq classification (step 3) as well as SEQC2 reformatting: here.

Step 5) Use GATK3 CombineVariants to merge 63 SNV VCF files and 63 INDEL VCF files


java -Xmx32G -jar GenomeAnalysisTK.jar -T CombineVariants -R GRCh38.d1.vd1.fa --setKey null --genotypemergeoption UNSORTED \-V EATRIS/bowtie.EA_T_1_vs_EA_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V EATRIS/bwa.EA_T_1_vs_EA_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V EATRIS/novo.EA_T_1_vs_EA_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Fudan/bowtie.FD_T_1_vs_FD_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Fudan/bowtie.FD_T_2_vs_FD_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Fudan/bowtie.FD_T_3_vs_FD_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Fudan/bwa.FD_T_1_vs_FD_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Fudan/bwa.FD_T_2_vs_FD_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Fudan/bwa.FD_T_3_vs_FD_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Fudan/novo.FD_T_1_vs_FD_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Fudan/novo.FD_T_2_vs_FD_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Fudan/novo.FD_T_3_vs_FD_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.IL_T_1_vs_IL_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.IL_T_2_vs_IL_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.IL_T_3_vs_IL_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.NS_T_1_vs_NS_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.NS_T_2_vs_NS_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.NS_T_3_vs_NS_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.NS_T_4_vs_NS_N_4/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.NS_T_5_vs_NS_N_5/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.NS_T_6_vs_NS_N_6/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.NS_T_7_vs_NS_N_7/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.NS_T_8_vs_NS_N_8/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.NS_T_9_vs_NS_N_9/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.IL_T_1_vs_IL_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.IL_T_2_vs_IL_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.IL_T_3_vs_IL_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.NS_T_1_vs_NS_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.NS_T_2_vs_NS_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.NS_T_3_vs_NS_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.NS_T_4_vs_NS_N_4/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.NS_T_5_vs_NS_N_5/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.NS_T_6_vs_NS_N_6/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.NS_T_7_vs_NS_N_7/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.NS_T_8_vs_NS_N_8/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.NS_T_9_vs_NS_N_9/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.IL_T_1_vs_IL_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.IL_T_2_vs_IL_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.IL_T_3_vs_IL_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.NS_T_1_vs_NS_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.NS_T_2_vs_NS_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.NS_T_3_vs_NS_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.NS_T_4_vs_NS_N_4/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.NS_T_5_vs_NS_N_5/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.NS_T_6_vs_NS_N_6/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.NS_T_7_vs_NS_N_7/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.NS_T_8_vs_NS_N_8/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.NS_T_9_vs_NS_N_9/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V LLU/bowtie.LL_T_1_vs_LL_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V LLU/bwa.LL_T_1_vs_LL_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V LLU/novo.LL_T_1_vs_LL_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V NCI/bowtie.NC_T_1_vs_NC_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V NCI/bwa.NC_T_1_vs_NC_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V NCI/novo.NC_T_1_vs_NC_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Novartis/bowtie.NV_T_1_vs_NV_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Novartis/bowtie.NV_T_2_vs_NV_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Novartis/bowtie.NV_T_3_vs_NV_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Novartis/bwa.NV_T_1_vs_NV_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Novartis/bwa.NV_T_2_vs_NV_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Novartis/bwa.NV_T_3_vs_NV_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Novartis/novo.NV_T_1_vs_NV_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Novartis/novo.NV_T_2_vs_NV_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Novartis/novo.NV_T_3_vs_NV_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-o sSNV.v2.8.1.combined.vcf

Step 6) Create an intermediate VCF file, preserving variant calls that has been labeled PASS in at least one out of the 63 sample sets

Command:

# SNV

seqc2/somaticseq/utilities/highConfidenceBuilder.py -infile sSNV.v2.8.1.combined.vcf.gz -outfile sSNV.MSDUKT.dedup_all.vcf -ncallers 3 -all


# INDEL

seqc2/somaticseq/utilities/highConfidenceBuilder.py -infile sINDEL.v2.8.1.combined.vcf.gz -outfile sINDEL.MDKT.dedup_all.vcf -ncallers 2 -all

Step 7) For each variant position from the previous step, extract sequencing features from each of the 63 BAM files


# SNV (Equivalent for INDEL)seqc2/somaticseq/SSeq_vcf2tsv_multiPairBam.py \-myvcf sSNV.MSDUKT.dedup_all.vcf.gz \-nprefix IL_N_1.bwa IL_N_1.bowtie IL_N_1.novo IL_N_2.bwa IL_N_2.bowtie IL_N_2.novo IL_N_3.bwa IL_N_3.bowtie IL_N_3.novo NV_N_1.bwa NV_N_1.bowtie NV_N_1.novo NV_N_2.bwa NV_N_2.bowtie NV_N_2.novo NV_N_3.bwa NV_N_3.bowtie NV_N_3.novo FD_N_1.bwa FD_N_1.bowtie FD_N_1.novo FD_N_2.bwa FD_N_2.bowtie FD_N_2.novo FD_N_3.bwa FD_N_3.bowtie FD_N_3.novo NS_N_1.bwa NS_N_1.bowtie NS_N_1.novo NS_N_2.bwa NS_N_2.bowtie NS_N_2.novo NS_N_3.bwa NS_N_3.bowtie NS_N_3.novo NS_N_4.bwa NS_N_4.bowtie NS_N_4.novo NS_N_5.bwa NS_N_5.bowtie NS_N_5.novo NS_N_6.bwa NS_N_6.bowtie NS_N_6.novo NS_N_7.bwa NS_N_7.bowtie NS_N_7.novo NS_N_8.bwa NS_N_8.bowtie NS_N_8.novo NS_N_9.bwa NS_N_9.bowtie NS_N_9.novo EA_N_1.bwa EA_N_1.bowtie EA_N_1.novo NC_N_1.bwa NC_N_1.bowtie NC_N_1.novo LL_N_1.bwa LL_N_1.bowtie LL_N_1.novo \-tprefix IL_T_1.bwa IL_T_1.bowtie IL_T_1.novo IL_T_2.bwa IL_T_2.bowtie IL_T_2.novo IL_T_3.bwa IL_T_3.bowtie IL_T_3.novo NV_T_1.bwa NV_T_1.bowtie NV_T_1.novo NV_T_2.bwa NV_T_2.bowtie NV_T_2.novo NV_T_3.bwa NV_T_3.bowtie NV_T_3.novo FD_T_1.bwa FD_T_1.bowtie FD_T_1.novo FD_T_2.bwa FD_T_2.bowtie FD_T_2.novo FD_T_3.bwa FD_T_3.bowtie FD_T_3.novo NS_T_1.bwa NS_T_1.bowtie NS_T_1.novo NS_T_2.bwa NS_T_2.bowtie NS_T_2.novo NS_T_3.bwa NS_T_3.bowtie NS_T_3.novo NS_T_4.bwa NS_T_4.bowtie NS_T_4.novo NS_T_5.bwa NS_T_5.bowtie NS_T_5.novo NS_T_6.bwa NS_T_6.bowtie NS_T_6.novo NS_T_7.bwa NS_T_7.bowtie NS_T_7.novo NS_T_8.bwa NS_T_8.bowtie NS_T_8.novo NS_T_9.bwa NS_T_9.bowtie NS_T_9.novo EA_T_1.bwa EA_T_1.bowtie EA_T_1.novo NC_T_1.bwa NC_T_1.bowtie NC_T_1.novo LL_T_1.bwa LL_T_1.bowtie LL_T_1.novo \-nbam WGS_IL_N_1.bwa.dedup.bam WGS_IL_N_1.bowtie.dedup.bam WGS_IL_N_1.novo.dedup.bam WGS_IL_N_2.bwa.dedup.bam WGS_IL_N_2.bowtie.dedup.bam WGS_IL_N_2.novo.dedup.bam WGS_IL_N_3.bwa.dedup.bam WGS_IL_N_3.bowtie.dedup.bam WGS_IL_N_3.novo.dedup.bam WGS_NV_N_1.bwa.dedup.bam WGS_NV_N_1.bowtie.dedup.bam WGS_NV_N_1.novo.dedup.bam WGS_NV_N_2.bwa.dedup.bam WGS_NV_N_2.bowtie.dedup.bam WGS_NV_N_2.novo.dedup.bam WGS_NV_N_3.bwa.dedup.bam WGS_NV_N_3.bowtie.dedup.bam WGS_NV_N_3.novo.dedup.bam WGS_FD_N_1.bwa.dedup.bam WGS_FD_N_1.bowtie.dedup.bam WGS_FD_N_1.novo.dedup.bam WGS_FD_N_2.bwa.dedup.bam WGS_FD_N_2.bowtie.dedup.bam WGS_FD_N_2.novo.dedup.bam WGS_FD_N_3.bwa.dedup.bam WGS_FD_N_3.bowtie.dedup.bam WGS_FD_N_3.novo.dedup.bam WGS_NS_N_1.bwa.dedup.bam WGS_NS_N_1.bowtie.dedup.bam WGS_NS_N_1.novo.dedup.bam WGS_NS_N_2.bwa.dedup.bam WGS_NS_N_2.bowtie.dedup.bam WGS_NS_N_2.novo.dedup.bam WGS_NS_N_3.bwa.dedup.bam WGS_NS_N_3.bowtie.dedup.bam WGS_NS_N_3.novo.dedup.bam WGS_NS_N_4.bwa.dedup.bam WGS_NS_N_4.bowtie.dedup.bam WGS_NS_N_4.novo.dedup.bam WGS_NS_N_5.bwa.dedup.bam WGS_NS_N_5.bowtie.dedup.bam WGS_NS_N_5.novo.dedup.bam WGS_NS_N_6.bwa.dedup.bam WGS_NS_N_6.bowtie.dedup.bam WGS_NS_N_6.novo.dedup.bam WGS_NS_N_7.bwa.dedup.bam WGS_NS_N_7.bowtie.dedup.bam WGS_NS_N_7.novo.dedup.bam WGS_NS_N_8.bwa.dedup.bam WGS_NS_N_8.bowtie.dedup.bam WGS_NS_N_8.novo.dedup.bam WGS_NS_N_9.bwa.dedup.bam WGS_NS_N_9.bowtie.dedup.bam WGS_NS_N_9.novo.dedup.bam WGS_EA_N_1.bwa.dedup.bam WGS_EA_N_1.bowtie.dedup.bam WGS_EA_N_1.novo.dedup.bam WGS_NC_N_1.bwa.dedup.bam WGS_NC_N_1.bowtie.dedup.bam WGS_NC_N_1.novo.dedup.bam WGS_LL_N_1.bwa.dedup.bam WGS_LL_N_1.bowtie.dedup.bam WGS_LL_N_1.novo.dedup.bam \-tbam WGS_IL_T_1.bwa.dedup.bam WGS_IL_T_1.bowtie.dedup.bam WGS_IL_T_1.novo.dedup.bam WGS_IL_T_2.bwa.dedup.bam WGS_IL_T_2.bowtie.dedup.bam WGS_IL_T_2.novo.dedup.bam WGS_IL_T_3.bwa.dedup.bam WGS_IL_T_3.bowtie.dedup.bam WGS_IL_T_3.novo.dedup.bam WGS_NV_T_1.bwa.dedup.bam WGS_NV_T_1.bowtie.dedup.bam WGS_NV_T_1.novo.dedup.bam WGS_NV_T_2.bwa.dedup.bam WGS_NV_T_2.bowtie.dedup.bam WGS_NV_T_2.novo.dedup.bam WGS_NV_T_3.bwa.dedup.bam WGS_NV_T_3.bowtie.dedup.bam WGS_NV_T_3.novo.dedup.bam WGS_FD_T_1.bwa.dedup.bam WGS_FD_T_1.bowtie.dedup.bam WGS_FD_T_1.novo.dedup.bam WGS_FD_T_2.bwa.dedup.bam WGS_FD_T_2.bowtie.dedup.bam WGS_FD_T_2.novo.dedup.bam WGS_FD_T_3.bwa.dedup.bam WGS_FD_T_3.bowtie.dedup.bam WGS_FD_T_3.novo.dedup.bam WGS_NS_T_1.bwa.dedup.bam WGS_NS_T_1.bowtie.dedup.bam WGS_NS_T_1.novo.dedup.bam WGS_NS_T_2.bwa.dedup.bam WGS_NS_T_2.bowtie.dedup.bam WGS_NS_T_2.novo.dedup.bam WGS_NS_T_3.bwa.dedup.bam WGS_NS_T_3.bowtie.dedup.bam WGS_NS_T_3.novo.dedup.bam WGS_NS_T_4.bwa.dedup.bam WGS_NS_T_4.bowtie.dedup.bam WGS_NS_T_4.novo.dedup.bam WGS_NS_T_5.bwa.dedup.bam WGS_NS_T_5.bowtie.dedup.bam WGS_NS_T_5.novo.dedup.bam WGS_NS_T_6.bwa.dedup.bam WGS_NS_T_6.bowtie.dedup.bam WGS_NS_T_6.novo.dedup.bam WGS_NS_T_7.bwa.dedup.bam WGS_NS_T_7.bowtie.dedup.bam WGS_NS_T_7.novo.dedup.bam WGS_NS_T_8.bwa.dedup.bam WGS_NS_T_8.bowtie.dedup.bam WGS_NS_T_8.novo.dedup.bam WGS_NS_T_9.bwa.dedup.bam WGS_NS_T_9.bowtie.dedup.bam WGS_NS_T_9.novo.dedup.bam WGS_EA_T_1.bwa.dedup.bam WGS_EA_T_1.bowtie.dedup.bam WGS_EA_T_1.novo.dedup.bam WGS_NC_T_1.bwa.dedup.bam WGS_NC_T_1.bowtie.dedup.bam WGS_NC_T_1.novo.dedup.bam WGS_LL_T_1.bwa.dedup.bam WGS_LL_T_1.bowtie.dedup.bam WGS_LL_T_1.novo.dedup.bam \-ref GRCh38.d1.vd1.fa -dbsnp dbsnp_146.hg38.vcf.gz -cosmic COSMICv83.All.noSNP.vcf -inclusion _T_ -callers MSDUKT -dedup -outfile 63BAMs.sSNV.MSDUKT.dedup_all.tsv

Step 8) Check with tumor purity assessment data sets, and add a FLAG=inconsistentVAF if VAF does not move as expected

1) Create VAF files for each tumor purity data out of the 300X SPP data sets

2) Make annotations/flags based on the VAF files created above, using the following command:

seqc2/somaticseq/utilities/titrationConsistencyTest.py \

-infile sSNV.MSDUKT.dedup_all.vcf.gz \

-vafs SPP/VAF.sSNV.MSDUKT.dedup_all_SPP_GT_1-0_mergeThree.bwa.dedup.txt \

      SPP/VAF.sSNV.MSDUKT.dedup_all_SPP_GT_3-1_mergeThree.bwa.dedup.txt \

      SPP/VAF.sSNV.MSDUKT.dedup_all_SPP_GT_1-1_mergeThree.bwa.dedup.txt \

      SPP/VAF.sSNV.MSDUKT.dedup_all_SPP_GT_1-4_mergeThree.bwa.dedup.txt \

      SPP/VAF.sSNV.MSDUKT.dedup_all_SPP_GT_0-1_mergeThree.bwa.dedup.txt \

-outfile sSNV.MSDUKT.dedup_all.SPP.vcf


seqc2/somaticseq/utilities/titrationConsistencyTest.py \

-infile sINDEL.MDKT.dedup_all.vcf.gz \

-vafs SPP/VAF.sINDEL.MDKT.dedup_all_SPP_GT_1-0_mergeThree.bwa.dedup.txt  \

      SPP/VAF.sINDEL.MDKT.dedup_all_SPP_GT_3-1_mergeThree.bwa.dedup.txt \

      SPP/VAF.sINDEL.MDKT.dedup_all_SPP_GT_1-1_mergeThree.bwa.dedup.txt \

      SPP/VAF.sINDEL.MDKT.dedup_all_SPP_GT_1-4_mergeThree.bwa.dedup.txt \

      SPP/VAF.sINDEL.MDKT.dedup_all_SPP_GT_0-1_mergeThree.bwa.dedup.txt \

-outfile sINDEL.MDKT.dedup_all.SPP.vcf

Step 9) Annotate initial confidence-levels of mutation candidates based on initial tiering and the sequencing features extracted in Step 7. 

Commands

# Merge the regions in MajorityAlignerCallable.bed, to create fewer regions to speed up the next two scripts:

mergeBed -i MajorityAlignersCallable.bed | intersedBed -a stdin -b genome.bed > BED/ConsensusCallableRegions.bed


seqc2/somaticseq/utilities/highConfidenceBuilder_2ndPass.py -vcfin sSNV.MSDUKT.dedup_all.SPP.vcf.gz -tsvin 63BAMs.sSNV.MSDUKT.dedup_all.tsv.gz -callable BED/ConsensusCallableRegions.bed -exclude BED/germline_chromosome_arm_loss.bed -outfile sSNV.MSDUKT.dedup_all.SPP.2ndPass.vcf -type snv


seqc2/somaticseq/utilities/highConfidenceBuilder_2ndPass.py -vcfin sINDEL.MDKT.dedup_all.SPP.vcf.gz -tsvin 63BAMs.sINDEL.MDKT.dedup_all.tsv.gz -callable BED/ConsensusCallableRegions.bed -exclude BED/germline_chromosome_arm_loss.bed -outfile sINDEL.MDKT.dedup_all.SPP.2ndPass.vcf -type indel

Step 10) Use two sets of high-depth data to "rescue" lower confidence-level low-VAF (≤ 12%) calls

Combine the nine NovaSeq replicates (380X)

Take advantage of the Genentech SPP data sets. Convert BWA BAM files to fastq, and then align them with Bowtie2 and NovoAlign (300X)

Recalibrate confidence-levels for low-VAF calls by the following rules:

# And equivalent for INDEL

seqc2/somaticseq/utilities/recalibrate_baseon_deepSeq.py \

-ref             GRCh38/GRCh38.d1.vd1.fa \

-infile          sSNV.MSDUKT.dedup_all.SPP.2ndPass.vcf.gz \

-outfile         sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescue.vcf \

--bignova-bwa    BigNova.snv.bwa.vcf.gz \

--bignova-bowtie BigNova.snv.bowtie.vcf.gz \

--bignova-novo   BigNova.snv.novo.vcf.gz \

--spp-bwa        SPP300X.snv.bwa.vcf.gz \

--spp-bowtie     SPP300X.snv.bowtie.vcf.gz \

--spp-novo       SPP300X.snv.novo.vcf.gz

Find calls that are made only in those 300X and 380X data sets

# Output calls made only in those >300X data sets:

/somaticseq/utilities/discoverDeeperSeqCalls.py -deep DeeperSeqCalls/DeeperSeq.snv.vcf.gz -gold sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescue01.vcf.gz -ref GRCh38.d1.vd1.fa -toolString MSDUKT -outfile DeeperSeqOnly.snv.vcf


# Only keep those in consensus callable loci:

bedtools intersect -header -a DeeperSeqOnly.snv.vcf -b BED/genome.bed | bedtools intersect -header -a stdin -b BED/germline_chromosome_arm_loss.bed -v | bedtools intersect -header -a stdin -b BED/ConsensusCallableRegions.bed | uniq | bgzip > DeeperSeqOnly.snv.vcf.gz


# Annotate those >300X only calls with the same information

seqc2/somaticseq/utilities/DeeperSeqOnly_TsvAndVcf.py -vcfin DeeperSeqOnly.snv.vcf.gz -tsvin 63BAMs.sSNV.MSDUKT.DeeperSeq.tsv.gz -outfile DeeperSeqOnly.snv.toAdd.vcf

Add the calls that are called by DeepSeq data only into the super set

vcf-concat sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescue01.vcf.gz DeeperSeqOnly.snv.toAdd.vcf.gz | seqc/somaticseq/utilities/vcfsorter.pl GRCh38.d1.vd1.dict - | bgzip > sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.vcf.gz

Step 11) Process NeuSomatic-E and NeuSomatic-S's VCF files

i) Convert VCF files in the same fashion as SomaticSeq's VCF files (like Step 4), so they can be merged with GATK CombineVariants (like Step 5)

for sample in EA_T_1.bowtie EA_T_1.bwa EA_T_1.novo FD_T_1.bowtie FD_T_1.bwa FD_T_1.novo FD_T_2.bowtie FD_T_2.bwa FD_T_2.novo FD_T_3.bowtie FD_T_3.bwa FD_T_3.novo IL_T_1.bowtie IL_T_1.bwa IL_T_1.novo IL_T_2.bowtie IL_T_2.bwa IL_T_2.novo IL_T_3.bowtie IL_T_3.bwa IL_T_3.novo LL_T_1.bowtie LL_T_1.bwa LL_T_1.novo NC_T_1.bowtie NC_T_1.bwa NC_T_1.novo NS_T_1.bowtie NS_T_1.bwa NS_T_1.novo NS_T_2.bowtie NS_T_2.bwa NS_T_2.novo NS_T_3.bowtie NS_T_3.bwa NS_T_3.novo NS_T_4.bowtie NS_T_4.bwa NS_T_4.novo NS_T_5.bowtie NS_T_5.bwa NS_T_5.novo NS_T_6.bowtie NS_T_6.bwa NS_T_6.novo NS_T_7.bowtie NS_T_7.bwa NS_T_7.novo NS_T_8.bowtie NS_T_8.bwa NS_T_8.novo NS_T_9.bowtie NS_T_9.bwa NS_T_9.novo NV_T_1.bowtie NV_T_1.bwa NV_T_1.novo NV_T_2.bowtie NV_T_2.bwa NV_T_2.novo NV_T_3.bowtie NV_T_3.bwa NV_T_3.novo

do

    seqc2/somaticseq/utilities/neusomatic/reformat_neuVcf.py -infile SEQC_wg1/E_WGS_${sample}.dedup/checkpoint_neusomatic_*_epoch*/output_.vcf -outfile NeuSomatic-E/VCFs/NeuSomatic.${sample}.vcf -tumor $sample

done

ii) Use GATK CombineVariants to make the NeuSomatic.All.vcf.gz file.

iii) Split it into SNV and INDEL separately

v3/somaticseq/vcfModifier/splitVcf.py -infile NeuSomatic-E/NeuSomatic.All.vcf.gz -snv NeuSomatic.sSNV.vcf -indel NeuSomatic.sINDEL.vcf

Step 12) Label SomaticSeq-based calls with NeuSomatic classification

SNV

seqc2/somaticseq/utilities/neusomatic/compare_Goldset_with_Neusomatic_E_S_calls.py -somaticseq sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.vcf.gz -nss NeuSomatic-S/NeuSomatic.sSNV.vcf.gz -nse NeuSomatic-E/NeuSomatic.sSNV.vcf.gz -outfile sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.Neu.vcf -neuonly NeuSomaticOnly.sSNV.vcf


seqc2/somaticseq/utilities/vcfsorter.pl GRCh38.d1.vd1.dict NeuSomaticOnly.sSNV.vcf | bgzip > NeuSomaticOnly.sSNV.vcf.gz

INDEL

seqc2/somaticseq/utilities/SEQC2_postProcess_InDel_VCF.py -infile sINDEL.MDKT.dedup_all.SPP.2ndPass.lowVafRescued.vcf.gz -outfile sINDEL.MDKT.dedup_all.SPP.2ndPass.lowVafRescued.properIndels.vcf


seqc2/somaticseq/utilities/neusomatic/compare_Goldset_with_Neusomatic_E_S_calls.py -somaticseq sINDEL.MDKT.dedup_all.SPP.2ndPass.lowVafRescued.properIndels.vcf.gz -nss NeuSomatic-S/NeuSomatic.sINDEL.0.25.vcf.gz -nse NeuSomatic-E/NeuSomatic.sINDEL.vcf.gz -outfile sINDEL.MDKT.dedup_all.SPP.2ndPass.lowVafRescued.properIndels.Neu.vcf -neuonly NeuSomaticOnly.sINDEL.vcf


seqc2/somaticseq/utilities/vcfsorter.pl GRCh38.d1.vd1.dict NeuSomaticOnly.sINDEL.vcf | bgzip > NeuSomaticOnly.sINDEL.vcf.gz

Step 13) Manual inspections for discrepant calls between SomaticSeq and NeuSomatic

seqc2/somaticseq/utilities/neusomatic/neusomatic_vs_somaticseq_discrepency_counter.py -infile sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.Neu.vcf.gz

Step 14) Re-label based on insights gained from NeuSomatic and SomaticSeq discrepancies

A few insights after manually looking at IGV alignments for calls that SomaticSeq and NeuSomatic disagree

seqc2/somaticseq/utilities/neusomatic/modify_confLabel.py -original sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.Neu.vcf.gz -mod inspecting_discrepant_calls.xlsx -maxREJECS 2 -outfile sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.Neu.reLabeled.vcf > SNV.labelChanged


seqc2/somaticseq/utilities/neusomatic/modify_confLabel.py -original sINDEL.MDKT.dedup_all.SPP.2ndPass.lowVafRescued.properIndels.Neu.vcf.gz -mod inspecting_discrepant_calls.xlsx -maxREJECS 4 -outfile sINDEL.MDKT.dedup_all.SPP.2ndPass.lowVafRescued.properIndels.Neu.reLabeled.vcf --promote-long-deletions > Indel.labelChanged

Step 15) Extract calls rescued by NeuSomatic (i.e., not found at all by SomaticSeq)

seqc2/somaticseq/utilities/neusomatic/reformat_neuOnlyVcf.py -gold sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.Neu.reLabeled.vcf.gz -neuVcf NeuSomaticOnly.sSNV.vcf.gz -neuMod inspecting_discrepant_calls.xlsx -outfile NeuRescued.sSNV.vcf


seqc2/somaticseq/utilities/neusomatic/reformat_neuOnlyVcf.py -gold sINDEL.MDKT.dedup_all.SPP.2ndPass.lowVafRescued.properIndels.Neu.reLabeled.vcf.gz -neuVcf NeuSomaticOnly.sINDEL.vcf.gz -neuMod inspecting_discrepant_calls.xlsx -outfile NeuRescued.sINDEL.vcf

Step 16) Relabel and rescue calls in 1300X data sets

# Relabel based on stringent criteria, and output calls only in >1300X data sets:

seqc2/somaticseq/utilities/rescue_by_1300X.py -gold sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.Neu.reLabeled.vcf.gz -deeps NeuSomatic-S/NeuSomaticS.bwa_1000X_SNV.vcf.gz NeuSomatic-S/NeuSomaticS.novo_1000X_SNV.vcf.gz NeuSomatic-S/NeuSomaticS.bowtie_1000X_SNV.vcf.gz -callable BED/ConsensusCallableRegions.bed -outfile sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.Neu.reLabeled.1300x.vcf > Uniq.sSNV.1300X.vcf


# Sort

cat Uniq.sSNV.1300X.vcf | seqc2/somaticseq/utilities/vcfsorter.pl GRCh38.d1.vd1.dict - | bgzip > Uniq.sSNV.1300X.vcf.gz


seqc2/somaticseq/utilities/DeeperSeqOnly_TsvAndVcf.py -vcfin Uniq.sSNV.1300X.vcf.gz -tsvin 63BAMs.sSNV.MSDUKT.1000X.tsv.gz -outfile 1000X.snv.toAdd.vcf


# Jettison 1300X only calls where germline signal > 1% VAF in normal

cat 1000X.snv.toAdd.vcf | egrep '^#|;NVAF=0.00' | bgzip > 1000X.snv.toAdd.vcf.gz

Step 17) Combine to make final super set and high-confidence call set

vcf-concat NeuRescued.sSNV.vcf.gz 1000X.snv.toAdd.vcf.gz   sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.Neu.reLabeled.1300x.vcf.gz | seqc2/somaticseq/utilities/vcfsorter.pl GRCh38.d1.vd1.dict - | awk -F "\t" '{ if ( $7 ~ /HighConf|MedConf/ && $8 !~ /ArmLossInNormal/ ) $7="PASS;"$7 }'1 OFS='\t' | bgzip > sSNV.MSDUKT.superSet.v1.1.vcf.gz

Step 18) Extract sequencing information from PacBio for each variant call using a modern version of SomaticSeq

# Install SomaticSeq so somatic_vcf2tsv.py will be in your PATH

somatic_vcf2tsv.py -myvcf .sSNV.MSDUKT.superSet.v1.1.vcf.gz -nbam  HCC1395BL.SqII.bam -tbam  HCC1395.SqII.bam -ref   GRCh38.d1.vd1.fa -minMQ 1 -minBQ 0 -mincaller 0 -outfile sSNV.MSDUKT.superSet.v1.1.PACB_annotated.tsv


somatic_vcf2tsv.py -myvcf .sINDEL.MDKT.superSet.v1.1.vcf.gz -nbam  HCC1395BL.SqII.bam -tbam  HCC1395.SqII.bam -ref   GRCh38.d1.vd1.fa -minMQ 1 -minBQ 0 -mincaller 0 -outfile sINDEL.MDKT.superSet.v1.1.PACB_annotated.tsv

Step 19) Add linguistic sequence complexity

seqc2/somaticseq/utilities/add_linguistic_sequence_complexity_to_vcf.py -infile sSNV.MSDUKT.superSet.v1.1.vcf.gz -outfile sSNV.MSDUKT.superSet.v1.1a.vcf -reference GRCh38.d1.vd1.fa


seqc2/somaticseq/utilities/add_linguistic_sequence_complexity_to_vcf.py -infile sINDEL.MDKT.superSet.v1.1.vcf.gz -outfile sINDEL.MDKT.superSet.v1.1a.vcf -reference GRCh38.d1.vd1.fa


# Compress

bgzip -f sSNV.MSDUKT.superSet.v1.1a.vcf

bgzip -f sINDEL.MDKT.superSet.v1.1a.vcf

Step 20) Annotate with PacBio sequencing information* and remove low-MQ and/or low-LSC calls that are unsupported by PacBio

seqc2/somaticseq/utilities/attach_PacBioTsvDepth.py -myvcf sSNV.MSDUKT.superSet.v1.1a.vcf.gz -mytsv PacBio/sSNV.MSDUKT.superSet.v1.1.PACB_annotated.tsv.gz -outfile sSNV.MSDUKT.superSet.v1.2.vcf

seqc2/somaticseq/utilities/attach_PacBioTsvDepth.py -myvcf sINDEL.MDKT.superSet.v1.1a.vcf.gz -mytsv PacBio/sINDEL.MDKT.superSet.v1.1.PACB_annotated.tsv.gz -outfile sINDEL.MDKT.superSet.v1.2.vcf


# Compress

bgzip -f sSNV.MSDUKT.superSet.v1.2.vcf

bgzip -f sINDEL.MDKT.superSet.v1.2.vcf

Step 21) Simplify release VCFs that only include high-confidence calls in the high-confidence region

seqc2/somaticseq/utilities/makeSeqc2HighConfidenceCallSets/minimize_highconf_vcfs.py -infile sSNV.MSDUKT.superSet.v1.2.vcf.gz -outfile high-confidence_sSNV_in_HC_regions_v1.2.vcf

seqc2/somaticseq/utilities/makeSeqc2HighConfidenceCallSets/minimize_highconf_vcfs.py -infile sINDEL.MDKT.superSet.v1.2.vcf.gz -outfile high-confidence_sINDEL_in_HC_regions_v1.2.vcf


# Compress

bgzip high-confidence_sSNV_in_HC_regions_v1.2.vcf

bgzip high-confidence_sINDEL_in_HC_regions_v1.2.vcf