seqc2 - High-confidence Somatic SNV and INDEL v1.2

Building High-Confidence Mutation Call Sets v1.2

Release v1.2

Latest reference call set along with the high-confidence regions be obtained from the NCBI server.

All-Inclusive Call Sets

These two files includes high-confidence calls (labeled PASS in the FILTER column), as well as low-confidence and likely false positive calls. Note that the PASS label is different from the AllPASS label. AllPASS is an intermediate label that denotes variants that SomaticSeq has classified as a somatic mutation in each of the 63 tumor-normal BAM files. PASS label denotes high-confidence somatic calls that are deemed "true somatic mutations."

sSNV.MSDUKT.superSet.v1.2.vcf.gz
sINDEL.MDKT.superSet.v1.2.vcf.gz

High-Confidence Somatic Mutations

These files include only the high-confidence somatic mutation calls in the high-confidence regions. They can be used as truth sets along with the high-confidence regions file.

high-confidence_sSNV_in_HC_regions_v1.2.vcf.gz
high-confidence_sINDEL_in_HC_regions_v1.2.vcf.gz

High-Confidence Regions

This file contains the regions of the genome where somatic mutations could be confidently determined. It excludes uncallable and unresolved genome regions. Variant calls outside these regions could not be confidently resolved.

High-Confidence_Regions_v1.2.bed

2021 MAQC Society 4th Annual Meeting

A 25-minute talk to describe how the reference call set was established on a high level

Improvements in v1.2 over v1.1

We used PacBio data to determine if our high-confidence somatic mutation calls are inconsistent with the orthogonal long-read technology. For high-VAF calls in low-mapping score (MQ<50) and low sequence complexity regions, we demoted their confidence level when they were unsupported by PacBio data (p < 0.05 for one-tailed two-proportion z-test when PacBio has no supporting read), resulting in the removal of 33 SNVs and 11 indels from the reference call set due to discordance from PacBio data.

Due to the relatively high error rate and low sequencing depth of our PacBio data, there are calls with zero variant read in PacBio. However, if p > 0.05 for these variant calls, they were not removed from the high-confidence call set, because we do not deem them to be inconsistent with PacBio data.

Steps to create the high-confidence call set v1.2

After the mutation calls are done, the steps to merge and consolidate results start in Step 4 below.
Many custom algorithms were created in the seqc2 branch of SomaticSeq. The v1.2 results can also be reproduced with seqc2_v1.2 of SomaticSeq (read the README.md) with the data_to_recreate_call_set_v1-2.tar.gz file on NCBI's site (also read the README in that .tar.gz file).

Step 1) Create SomaticSeq classifiers for each sequencing site (i.e., data group) that produced technical replicates

When there are sequencing replicates for the normal genome, one of them can be designated as the tumor. We used BamSurgeon to mutate reads in the designated tumor, and then run our SomaticSeq and NeuSomatic workflows where only the synthetic mutations were labeled as true positives.
This step was done locally at Roche Sequencing Solutions (RSS), and the results were uploaded to CGC.
- 15 sSNV SomaticSeq classifiers and 15 sINDEL SomaticSeq classifiers were created, i.e., for every combination of IL/NS/FD/NV and BWA/Bowtie2/NovoAlign (12), plus 3 more SNV and INDEL classifiers were created by combining all the training data (IL+NS+FD+NV) for each of BWA/Bowtie2/NovoAlign, and used them to classify EA/NC/LL.
- For NeuSomatic, we combined the data sets for each aligner to create three NeuSomatic-Ensemble (one for each of the three aligners) and three NeuSomatic-Standalone classifiers. NeuSomatic uses one single classifier for both SNV and INDEL.
For each classifier, bamsurgeon workflow was used to create ~100K synthetic SNVs and ~20K synthetic INDELs in replicated normals as those sites. As an example, the classifiers for Fudan data with BWA was created as follows:

1. Synthetic mutations were spiked into bwa.FD_N_2, and then SomaticSeq tumor-normal analysis (6 somatic mutation callers in MuTect2, SomaticSniper, VarDict, MuSE, Strelka, and TNscope) were run with bwa.FD_N_1 as the matched normal.
2. Synthetic mutations were spiked into bwa.FD_N_3, and then SomaticSeq tumor-normal analysis were run with bwa.FD_N_2 as the matched normal.
3. The two analyses above were combined to create SomaticSeq classifier (i.e., concatenating the Ensemble.sSNV.tsv files from the two analyses). The two classifiers trained from this data (one for sSNV and one for sINDEL) set is used to score/classify all three real BWA tumor-normal sample sets from Fudan.

The detailed documentation of all those runs are described in detail in the page: BAM Simulation pipeline in SomaticSeq, which not only describes our commands, but has link to our archived run scripts.

Files related to this step uploaded to CGC are in FDA_SEQC2_WG#1 projects with tags of "bamsurgeon."

Examples

bwa.tumorDesignate_FD_N_2.bam is the semi-synthetic tumor BAM file by spiking synthetic mutations into WGS_FD_N_2.bwa.dedup.bam.
bwa.tumorDesignate_FD_N_2.synthetic_snvs.vcf is the ground truth sSNV for the BAM file described above.
bwa.tumorDesignate_FD_N_2.synthetic_indels.leftAlign.vcf is the ground truth sINDEL for the BAM file described above.
bwa.FD.twoSets.MSDUKT.sSNV.tsv.ntChange.Classifier.RData is the sSNV SomaticSeq classifier.
bwa.FD.twoSets.MDKT.sINDEL.tsv.ntChange.Classifier.RData is the sINDEL SomaticSeq classifier.

Step 2) Run six somatic mutation callers for each of the 63 tumor-normal pairs on CGC

Example CGC Workflow to run MuTect2, SomaticSniper, VarDict, MuSE, and Strelka on a pair of tumor-normal data set:

https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/e9954c1b-57a8-4b5c-8228-7424b1f1438d

Example of CGC Workflow to run TNscope 201711.02:

https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/87524eee-6ba5-494c-9f18-f6a0f88a3a51

The details of somatic mutation workflows on CGC, and the location of the result output files are described in this page: Somatic Mutation Results on Seven Bridges Genomics.

Descriptions of each somatic mutatino caller on CGC workflow is described in this page: Somatic Mutation Tools on CGC.
This page showed "historical" progress made to successfully test each piece of tools/apps/workflows on CGC: Building workflow on Seven Bridges Genomics.

Step 3) Use SomaticSeq to classify mutation calls for each of the 63 call sets

This step was also done locally at Roche Sequencing. However, the results as well as working examples are on the CGC.

For each of the 54 data sets that came from a sequencing center with replicates, i.e., IL/NS/FD/NV with BWA/Bowtie/NovoAlign, the corresponding SomaticSeq classifier was used to score each variant. For EA/NC/LL data, the classifiers trained by combining IL/NS/FD/NV data sets were used.

An example on CGC: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/7137edf0-c49c-4339-b948-d5c93e9b7b33

Documentations and files for archival purposes:

The commands/scripts to use SomaticSeq to combine the results of 6 callers for each of the 63 data sets: here.
The commands/scripts to invoke SomaticSeq classification as well as SEQC2 reformatting (described in step 4): here.

Step 4) Modify the VCF file format for each of the 63 call set, so they can be merged later

We moved useful items in INFO to sample columns, in order to preserve such information after combining the VCF files using GATK3 CombineVariants. The script is in the seqc2 branch of SomaticSeq.

Command used for VCF files that were classified by a SomaticSeq classifier:

seqc2/somaticseq/utilities/reformat_VCF2SEQC2.py \

-infile WGS.bwa.dedup-FD_T_1_vs_FD_N_1-SomaticSeq.sSNV.vcf \

-outfile reFormat.sSNV.predicted.by.bwa.IL1N_vs_IL2N.twoWay.vcf \

-callers MSDUKT \

-tumor FD_T_1.bwa \

-trained

-caller MSDUKT
- In the original VCF file, the INFO field contains a string such as MSDUKT=1,0,1,1,0,1 (M=MuTect, S=SomaticSniper, D=VarDict, U=MuSE, K=Strelka, T=TNscope) that denotes the callers that called the variant a somatic mutation. This field is moved to the sample field, so that when combined, this information is preserved for each data set. For INDEL, this string is MDKT.
-tumor FD_T_1.bwa
- To name sample name FD_T_1.bwa, so sample names of the same sequencing replicates do not clash when combining variants.

The commands/scripts to invoke SomaticSeq classification (step 3) as well as SEQC2 reformatting: here.

Step 5) Use GATK3 CombineVariants to merge 63 SNV VCF files and 63 INDEL VCF files

SNV and INDEL are done separately.
Reformatted VCF files were created for this purpose.
These are the two commands, one for SNV and one for INDEL. The following is for SNV.

java -Xmx32G -jar GenomeAnalysisTK.jar -T CombineVariants -R GRCh38.d1.vd1.fa --setKey null --genotypemergeoption UNSORTED \-V EATRIS/bowtie.EA_T_1_vs_EA_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V EATRIS/bwa.EA_T_1_vs_EA_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V EATRIS/novo.EA_T_1_vs_EA_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Fudan/bowtie.FD_T_1_vs_FD_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Fudan/bowtie.FD_T_2_vs_FD_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Fudan/bowtie.FD_T_3_vs_FD_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Fudan/bwa.FD_T_1_vs_FD_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Fudan/bwa.FD_T_2_vs_FD_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Fudan/bwa.FD_T_3_vs_FD_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Fudan/novo.FD_T_1_vs_FD_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Fudan/novo.FD_T_2_vs_FD_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Fudan/novo.FD_T_3_vs_FD_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.IL_T_1_vs_IL_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.IL_T_2_vs_IL_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.IL_T_3_vs_IL_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.NS_T_1_vs_NS_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.NS_T_2_vs_NS_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.NS_T_3_vs_NS_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.NS_T_4_vs_NS_N_4/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.NS_T_5_vs_NS_N_5/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.NS_T_6_vs_NS_N_6/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.NS_T_7_vs_NS_N_7/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.NS_T_8_vs_NS_N_8/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bowtie.NS_T_9_vs_NS_N_9/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.IL_T_1_vs_IL_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.IL_T_2_vs_IL_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.IL_T_3_vs_IL_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.NS_T_1_vs_NS_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.NS_T_2_vs_NS_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.NS_T_3_vs_NS_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.NS_T_4_vs_NS_N_4/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.NS_T_5_vs_NS_N_5/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.NS_T_6_vs_NS_N_6/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.NS_T_7_vs_NS_N_7/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.NS_T_8_vs_NS_N_8/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/bwa.NS_T_9_vs_NS_N_9/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.IL_T_1_vs_IL_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.IL_T_2_vs_IL_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.IL_T_3_vs_IL_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.NS_T_1_vs_NS_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.NS_T_2_vs_NS_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.NS_T_3_vs_NS_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.NS_T_4_vs_NS_N_4/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.NS_T_5_vs_NS_N_5/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.NS_T_6_vs_NS_N_6/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.NS_T_7_vs_NS_N_7/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.NS_T_8_vs_NS_N_8/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Illumina/novo.NS_T_9_vs_NS_N_9/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V LLU/bowtie.LL_T_1_vs_LL_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V LLU/bwa.LL_T_1_vs_LL_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V LLU/novo.LL_T_1_vs_LL_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V NCI/bowtie.NC_T_1_vs_NC_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V NCI/bwa.NC_T_1_vs_NC_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V NCI/novo.NC_T_1_vs_NC_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Novartis/bowtie.NV_T_1_vs_NV_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Novartis/bowtie.NV_T_2_vs_NV_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Novartis/bowtie.NV_T_3_vs_NV_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Novartis/bwa.NV_T_1_vs_NV_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Novartis/bwa.NV_T_2_vs_NV_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Novartis/bwa.NV_T_3_vs_NV_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Novartis/novo.NV_T_1_vs_NV_N_1/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Novartis/novo.NV_T_2_vs_NV_N_2/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-V Novartis/novo.NV_T_3_vs_NV_N_3/somaticMutations/SomaticSeq_MSDUKT.v2.7.2/reFormat.sSNV.predicted.by.v2.7.2.vcf \-o sSNV.v2.8.1.combined.vcf

Step 6) Create an intermediate VCF file, preserving variant calls that has been labeled PASS in at least one out of the 63 sample sets

PASS means a SomaticSeq ≥ 0.7
Reduce number of total variant positions by keeping variant calls that have been deemed PASS in at least one sample set, e.g., tossing out variants who only made it this far due to REJECT or germline output by one caller in one sample. In other words, every variant call in the super set has at least some level of evidence support.
Initial tiers are assigned as described here.

Command:

# SNV

seqc2/somaticseq/utilities/highConfidenceBuilder.py -infile sSNV.v2.8.1.combined.vcf.gz -outfile sSNV.MSDUKT.dedup_all.vcf -ncallers 3 -all

# INDEL

seqc2/somaticseq/utilities/highConfidenceBuilder.py -infile sINDEL.v2.8.1.combined.vcf.gz -outfile sINDEL.MDKT.dedup_all.vcf -ncallers 2 -all

-ncallers argument (obsolete in v1.1 since we don't use caller consensus for somatic classifications)
- There are 6 somatic SNV callers, so 3 constitutes "majority-vote consensus."
- There are 4 somatic INDEL callers, so 2 or more constitutes "majority-vote consensus."

Step 7) For each variant position from the previous step, extract sequencing features from each of the 63 BAM files

# SNV (Equivalent for INDEL)seqc2/somaticseq/SSeq_vcf2tsv_multiPairBam.py \-myvcf sSNV.MSDUKT.dedup_all.vcf.gz \-nprefix IL_N_1.bwa IL_N_1.bowtie IL_N_1.novo IL_N_2.bwa IL_N_2.bowtie IL_N_2.novo IL_N_3.bwa IL_N_3.bowtie IL_N_3.novo NV_N_1.bwa NV_N_1.bowtie NV_N_1.novo NV_N_2.bwa NV_N_2.bowtie NV_N_2.novo NV_N_3.bwa NV_N_3.bowtie NV_N_3.novo FD_N_1.bwa FD_N_1.bowtie FD_N_1.novo FD_N_2.bwa FD_N_2.bowtie FD_N_2.novo FD_N_3.bwa FD_N_3.bowtie FD_N_3.novo NS_N_1.bwa NS_N_1.bowtie NS_N_1.novo NS_N_2.bwa NS_N_2.bowtie NS_N_2.novo NS_N_3.bwa NS_N_3.bowtie NS_N_3.novo NS_N_4.bwa NS_N_4.bowtie NS_N_4.novo NS_N_5.bwa NS_N_5.bowtie NS_N_5.novo NS_N_6.bwa NS_N_6.bowtie NS_N_6.novo NS_N_7.bwa NS_N_7.bowtie NS_N_7.novo NS_N_8.bwa NS_N_8.bowtie NS_N_8.novo NS_N_9.bwa NS_N_9.bowtie NS_N_9.novo EA_N_1.bwa EA_N_1.bowtie EA_N_1.novo NC_N_1.bwa NC_N_1.bowtie NC_N_1.novo LL_N_1.bwa LL_N_1.bowtie LL_N_1.novo \-tprefix IL_T_1.bwa IL_T_1.bowtie IL_T_1.novo IL_T_2.bwa IL_T_2.bowtie IL_T_2.novo IL_T_3.bwa IL_T_3.bowtie IL_T_3.novo NV_T_1.bwa NV_T_1.bowtie NV_T_1.novo NV_T_2.bwa NV_T_2.bowtie NV_T_2.novo NV_T_3.bwa NV_T_3.bowtie NV_T_3.novo FD_T_1.bwa FD_T_1.bowtie FD_T_1.novo FD_T_2.bwa FD_T_2.bowtie FD_T_2.novo FD_T_3.bwa FD_T_3.bowtie FD_T_3.novo NS_T_1.bwa NS_T_1.bowtie NS_T_1.novo NS_T_2.bwa NS_T_2.bowtie NS_T_2.novo NS_T_3.bwa NS_T_3.bowtie NS_T_3.novo NS_T_4.bwa NS_T_4.bowtie NS_T_4.novo NS_T_5.bwa NS_T_5.bowtie NS_T_5.novo NS_T_6.bwa NS_T_6.bowtie NS_T_6.novo NS_T_7.bwa NS_T_7.bowtie NS_T_7.novo NS_T_8.bwa NS_T_8.bowtie NS_T_8.novo NS_T_9.bwa NS_T_9.bowtie NS_T_9.novo EA_T_1.bwa EA_T_1.bowtie EA_T_1.novo NC_T_1.bwa NC_T_1.bowtie NC_T_1.novo LL_T_1.bwa LL_T_1.bowtie LL_T_1.novo \-nbam WGS_IL_N_1.bwa.dedup.bam WGS_IL_N_1.bowtie.dedup.bam WGS_IL_N_1.novo.dedup.bam WGS_IL_N_2.bwa.dedup.bam WGS_IL_N_2.bowtie.dedup.bam WGS_IL_N_2.novo.dedup.bam WGS_IL_N_3.bwa.dedup.bam WGS_IL_N_3.bowtie.dedup.bam WGS_IL_N_3.novo.dedup.bam WGS_NV_N_1.bwa.dedup.bam WGS_NV_N_1.bowtie.dedup.bam WGS_NV_N_1.novo.dedup.bam WGS_NV_N_2.bwa.dedup.bam WGS_NV_N_2.bowtie.dedup.bam WGS_NV_N_2.novo.dedup.bam WGS_NV_N_3.bwa.dedup.bam WGS_NV_N_3.bowtie.dedup.bam WGS_NV_N_3.novo.dedup.bam WGS_FD_N_1.bwa.dedup.bam WGS_FD_N_1.bowtie.dedup.bam WGS_FD_N_1.novo.dedup.bam WGS_FD_N_2.bwa.dedup.bam WGS_FD_N_2.bowtie.dedup.bam WGS_FD_N_2.novo.dedup.bam WGS_FD_N_3.bwa.dedup.bam WGS_FD_N_3.bowtie.dedup.bam WGS_FD_N_3.novo.dedup.bam WGS_NS_N_1.bwa.dedup.bam WGS_NS_N_1.bowtie.dedup.bam WGS_NS_N_1.novo.dedup.bam WGS_NS_N_2.bwa.dedup.bam WGS_NS_N_2.bowtie.dedup.bam WGS_NS_N_2.novo.dedup.bam WGS_NS_N_3.bwa.dedup.bam WGS_NS_N_3.bowtie.dedup.bam WGS_NS_N_3.novo.dedup.bam WGS_NS_N_4.bwa.dedup.bam WGS_NS_N_4.bowtie.dedup.bam WGS_NS_N_4.novo.dedup.bam WGS_NS_N_5.bwa.dedup.bam WGS_NS_N_5.bowtie.dedup.bam WGS_NS_N_5.novo.dedup.bam WGS_NS_N_6.bwa.dedup.bam WGS_NS_N_6.bowtie.dedup.bam WGS_NS_N_6.novo.dedup.bam WGS_NS_N_7.bwa.dedup.bam WGS_NS_N_7.bowtie.dedup.bam WGS_NS_N_7.novo.dedup.bam WGS_NS_N_8.bwa.dedup.bam WGS_NS_N_8.bowtie.dedup.bam WGS_NS_N_8.novo.dedup.bam WGS_NS_N_9.bwa.dedup.bam WGS_NS_N_9.bowtie.dedup.bam WGS_NS_N_9.novo.dedup.bam WGS_EA_N_1.bwa.dedup.bam WGS_EA_N_1.bowtie.dedup.bam WGS_EA_N_1.novo.dedup.bam WGS_NC_N_1.bwa.dedup.bam WGS_NC_N_1.bowtie.dedup.bam WGS_NC_N_1.novo.dedup.bam WGS_LL_N_1.bwa.dedup.bam WGS_LL_N_1.bowtie.dedup.bam WGS_LL_N_1.novo.dedup.bam \-tbam WGS_IL_T_1.bwa.dedup.bam WGS_IL_T_1.bowtie.dedup.bam WGS_IL_T_1.novo.dedup.bam WGS_IL_T_2.bwa.dedup.bam WGS_IL_T_2.bowtie.dedup.bam WGS_IL_T_2.novo.dedup.bam WGS_IL_T_3.bwa.dedup.bam WGS_IL_T_3.bowtie.dedup.bam WGS_IL_T_3.novo.dedup.bam WGS_NV_T_1.bwa.dedup.bam WGS_NV_T_1.bowtie.dedup.bam WGS_NV_T_1.novo.dedup.bam WGS_NV_T_2.bwa.dedup.bam WGS_NV_T_2.bowtie.dedup.bam WGS_NV_T_2.novo.dedup.bam WGS_NV_T_3.bwa.dedup.bam WGS_NV_T_3.bowtie.dedup.bam WGS_NV_T_3.novo.dedup.bam WGS_FD_T_1.bwa.dedup.bam WGS_FD_T_1.bowtie.dedup.bam WGS_FD_T_1.novo.dedup.bam WGS_FD_T_2.bwa.dedup.bam WGS_FD_T_2.bowtie.dedup.bam WGS_FD_T_2.novo.dedup.bam WGS_FD_T_3.bwa.dedup.bam WGS_FD_T_3.bowtie.dedup.bam WGS_FD_T_3.novo.dedup.bam WGS_NS_T_1.bwa.dedup.bam WGS_NS_T_1.bowtie.dedup.bam WGS_NS_T_1.novo.dedup.bam WGS_NS_T_2.bwa.dedup.bam WGS_NS_T_2.bowtie.dedup.bam WGS_NS_T_2.novo.dedup.bam WGS_NS_T_3.bwa.dedup.bam WGS_NS_T_3.bowtie.dedup.bam WGS_NS_T_3.novo.dedup.bam WGS_NS_T_4.bwa.dedup.bam WGS_NS_T_4.bowtie.dedup.bam WGS_NS_T_4.novo.dedup.bam WGS_NS_T_5.bwa.dedup.bam WGS_NS_T_5.bowtie.dedup.bam WGS_NS_T_5.novo.dedup.bam WGS_NS_T_6.bwa.dedup.bam WGS_NS_T_6.bowtie.dedup.bam WGS_NS_T_6.novo.dedup.bam WGS_NS_T_7.bwa.dedup.bam WGS_NS_T_7.bowtie.dedup.bam WGS_NS_T_7.novo.dedup.bam WGS_NS_T_8.bwa.dedup.bam WGS_NS_T_8.bowtie.dedup.bam WGS_NS_T_8.novo.dedup.bam WGS_NS_T_9.bwa.dedup.bam WGS_NS_T_9.bowtie.dedup.bam WGS_NS_T_9.novo.dedup.bam WGS_EA_T_1.bwa.dedup.bam WGS_EA_T_1.bowtie.dedup.bam WGS_EA_T_1.novo.dedup.bam WGS_NC_T_1.bwa.dedup.bam WGS_NC_T_1.bowtie.dedup.bam WGS_NC_T_1.novo.dedup.bam WGS_LL_T_1.bwa.dedup.bam WGS_LL_T_1.bowtie.dedup.bam WGS_LL_T_1.novo.dedup.bam \-ref GRCh38.d1.vd1.fa -dbsnp dbsnp_146.hg38.vcf.gz -cosmic COSMICv83.All.noSNP.vcf -inclusion _T_ -callers MSDUKT -dedup -outfile 63BAMs.sSNV.MSDUKT.dedup_all.tsv

Step 8) Check with tumor purity assessment data sets, and add a FLAG=inconsistentVAF if VAF does not move as expected

1) Create VAF files for each tumor purity data out of the 300X SPP data sets

SNV task on CGC:
- https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/d2ae793b-12b4-46cc-9be4-935e63fb1ba3/
INDEL task on CGC:
- https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/99da2fe2-6aa8-44ea-9dd0-8c73e63de89a

2) Make annotations/flags based on the VAF files created above, using the following command:

seqc2/somaticseq/utilities/titrationConsistencyTest.py \

-infile sSNV.MSDUKT.dedup_all.vcf.gz \

-vafs SPP/VAF.sSNV.MSDUKT.dedup_all_SPP_GT_1-0_mergeThree.bwa.dedup.txt \

SPP/VAF.sSNV.MSDUKT.dedup_all_SPP_GT_3-1_mergeThree.bwa.dedup.txt \

SPP/VAF.sSNV.MSDUKT.dedup_all_SPP_GT_1-1_mergeThree.bwa.dedup.txt \

SPP/VAF.sSNV.MSDUKT.dedup_all_SPP_GT_1-4_mergeThree.bwa.dedup.txt \

SPP/VAF.sSNV.MSDUKT.dedup_all_SPP_GT_0-1_mergeThree.bwa.dedup.txt \

-outfile sSNV.MSDUKT.dedup_all.SPP.vcf

seqc2/somaticseq/utilities/titrationConsistencyTest.py \

-infile sINDEL.MDKT.dedup_all.vcf.gz \

-vafs SPP/VAF.sINDEL.MDKT.dedup_all_SPP_GT_1-0_mergeThree.bwa.dedup.txt \

SPP/VAF.sINDEL.MDKT.dedup_all_SPP_GT_3-1_mergeThree.bwa.dedup.txt \

SPP/VAF.sINDEL.MDKT.dedup_all_SPP_GT_1-1_mergeThree.bwa.dedup.txt \

SPP/VAF.sINDEL.MDKT.dedup_all_SPP_GT_1-4_mergeThree.bwa.dedup.txt \

SPP/VAF.sINDEL.MDKT.dedup_all_SPP_GT_0-1_mergeThree.bwa.dedup.txt \

-outfile sINDEL.MDKT.dedup_all.SPP.vcf

Step 9) Annotate initial confidence-levels of mutation candidates based on initial tiering and the sequencing features extracted in Step 7.

This is the "first" attempt to annotate four confidence-levels (HighConf, MedConf, LowConf, and Unclassified), described here, based on SomaticSeq classification of the 63 WGS data sets.
The confidence level of some low-VAF calls will be refined and modified based on NeuSomatic classification and additional deep sequencing data later.
The file MajorityAlignerCallable.bed, was created as described here.

Commands

# Merge the regions in MajorityAlignerCallable.bed, to create fewer regions to speed up the next two scripts:

mergeBed -i MajorityAlignersCallable.bed | intersedBed -a stdin -b genome.bed > BED/ConsensusCallableRegions.bed

seqc2/somaticseq/utilities/highConfidenceBuilder_2ndPass.py -vcfin sSNV.MSDUKT.dedup_all.SPP.vcf.gz -tsvin 63BAMs.sSNV.MSDUKT.dedup_all.tsv.gz -callable BED/ConsensusCallableRegions.bed -exclude BED/germline_chromosome_arm_loss.bed -outfile sSNV.MSDUKT.dedup_all.SPP.2ndPass.vcf -type snv

seqc2/somaticseq/utilities/highConfidenceBuilder_2ndPass.py -vcfin sINDEL.MDKT.dedup_all.SPP.vcf.gz -tsvin 63BAMs.sINDEL.MDKT.dedup_all.tsv.gz -callable BED/ConsensusCallableRegions.bed -exclude BED/germline_chromosome_arm_loss.bed -outfile sINDEL.MDKT.dedup_all.SPP.2ndPass.vcf -type indel

These files above have not been filtered for the 3 arm losses, but variants in them are flagged with "ArmLossInNormal" if they are. Variants outside ConsensusCallableRegions.bed are flagged with NonCallable in FLAGS.

Step 10) Use two sets of high-depth data to "rescue" lower confidence-level low-VAF (≤ 12%) calls

Combine the nine NovaSeq replicates (380X)

Uploaded to CGC, named singleSM.WGS_NS_(N|T)_combine9.(bwa|bowtie|novo).dedup.bam
- singleSM.WGS_NS_T_combine9.bwa.dedup.bam
- singleSM.WGS_NS_N_combine9.bwa.dedup.bam
- singleSM.WGS_NS_T_combine9.bowtie.dedup.bam
- singleSM.WGS_NS_N_combine9.bowtie.dedup.bam
- singleSM.WGS_NS_T_combine9.novo.dedup.bam
- singleSM.WGS_NS_N_combine9.novo.dedup.bam
- Somatic Mutation Calling results: CombineNovaSeq.zip
- SomaticSeq classifiers used: combinedNovaSeqClassifiers.zip

Take advantage of the Genentech SPP data sets. Convert BWA BAM files to fastq, and then align them with Bowtie2 and NovoAlign (300X)

Convert BAM files to fastq files based on read groups, and then align using bwa, bowtie2, and novoalign
- Genentech.300X.Tumor.bwa.dedup.bam
- Genentech.300X.Normal.bwa.dedup.bam
- Genentech.300X.Tumor.bowtie.dedup.bam
- Genentech.300X.Normal.bowtie.dedup.bam
- Genentech.300X.Tumor.novo.dedup.bam
- Genentech.300X.Normal.novo.dedup.bam
- Somatic Mutation Calling results: SPP.300X.zip
- SomaticSeq classifiers used: SPP.300X.Classifiers.zip

Recalibrate confidence-levels for low-VAF calls by the following rules:

- Promote LowConf and MedConf calls to HighConf if there is zero REJECT classifications by SomaticSeq (i.e., variants not classified as PASS because there is no read in the sample, and not due to a called variant classified as REJECT), and if the variant is called PASS in all six >300X data sets..
- Promote Unclassified and LowConf calls to MedConf if the variant is called PASS in at least ≥4 of the six >300X data sets with no REJECT and represented by all three aligners (i.e., if the variant is called PASS in BWA and NovoAlign in combined NovaSeq, and called PASS in Bowtie2 and BWA in SPP data, with no REJECT or Missing classifications, it would have met this requirement).
- Promote Unclassified to LowConf, and LowConf to MedConf if there is ≥1 PASS in combined NovaSeq, ≥1 PASS in SPP, ≥1 PASS in bwa, ≥1 PASS in NovoAlign, and ≥1 PASS in Bowtie2, with no REJECT or Missing classifications in any samples (LowQual classifications are tolerable).
- Promote Unclassified to LowConf if there is ≥1 PASS classification
- Demote MedConf to LowConf if there is only REJECT and Missing but not PASS classification.

# And equivalent for INDEL

seqc2/somaticseq/utilities/recalibrate_baseon_deepSeq.py \

-ref GRCh38/GRCh38.d1.vd1.fa \

-infile sSNV.MSDUKT.dedup_all.SPP.2ndPass.vcf.gz \

-outfile sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescue.vcf \

--bignova-bwa BigNova.snv.bwa.vcf.gz \

--bignova-bowtie BigNova.snv.bowtie.vcf.gz \

--bignova-novo BigNova.snv.novo.vcf.gz \

--spp-bwa SPP300X.snv.bwa.vcf.gz \

--spp-bowtie SPP300X.snv.bowtie.vcf.gz \

--spp-novo SPP300X.snv.novo.vcf.gz

Find calls that are made only in those 300X and 380X data sets

- Criteria were set to label MedConf and HighConf calls at this step. However, none fulfilled such criteria.
- Only LowConf calls were found here, which met the threshold of ≥2 PASS out of 6.

# Output calls made only in those >300X data sets:

/somaticseq/utilities/discoverDeeperSeqCalls.py -deep DeeperSeqCalls/DeeperSeq.snv.vcf.gz -gold sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescue01.vcf.gz -ref GRCh38.d1.vd1.fa -toolString MSDUKT -outfile DeeperSeqOnly.snv.vcf

# Only keep those in consensus callable loci:

bedtools intersect -header -a DeeperSeqOnly.snv.vcf -b BED/genome.bed | bedtools intersect -header -a stdin -b BED/germline_chromosome_arm_loss.bed -v | bedtools intersect -header -a stdin -b BED/ConsensusCallableRegions.bed | uniq | bgzip > DeeperSeqOnly.snv.vcf.gz

# Annotate those >300X only calls with the same information

seqc2/somaticseq/utilities/DeeperSeqOnly_TsvAndVcf.py -vcfin DeeperSeqOnly.snv.vcf.gz -tsvin 63BAMs.sSNV.MSDUKT.DeeperSeq.tsv.gz -outfile DeeperSeqOnly.snv.toAdd.vcf

Add the calls that are called by DeepSeq data only into the super set

vcf-concat sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescue01.vcf.gz DeeperSeqOnly.snv.toAdd.vcf.gz | seqc/somaticseq/utilities/vcfsorter.pl GRCh38.d1.vd1.dict - | bgzip > sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.vcf.gz

Step 11) Process NeuSomatic-E and NeuSomatic-S's VCF files

i) Convert VCF files in the same fashion as SomaticSeq's VCF files (like Step 4), so they can be merged with GATK CombineVariants (like Step 5)

for sample in EA_T_1.bowtie EA_T_1.bwa EA_T_1.novo FD_T_1.bowtie FD_T_1.bwa FD_T_1.novo FD_T_2.bowtie FD_T_2.bwa FD_T_2.novo FD_T_3.bowtie FD_T_3.bwa FD_T_3.novo IL_T_1.bowtie IL_T_1.bwa IL_T_1.novo IL_T_2.bowtie IL_T_2.bwa IL_T_2.novo IL_T_3.bowtie IL_T_3.bwa IL_T_3.novo LL_T_1.bowtie LL_T_1.bwa LL_T_1.novo NC_T_1.bowtie NC_T_1.bwa NC_T_1.novo NS_T_1.bowtie NS_T_1.bwa NS_T_1.novo NS_T_2.bowtie NS_T_2.bwa NS_T_2.novo NS_T_3.bowtie NS_T_3.bwa NS_T_3.novo NS_T_4.bowtie NS_T_4.bwa NS_T_4.novo NS_T_5.bowtie NS_T_5.bwa NS_T_5.novo NS_T_6.bowtie NS_T_6.bwa NS_T_6.novo NS_T_7.bowtie NS_T_7.bwa NS_T_7.novo NS_T_8.bowtie NS_T_8.bwa NS_T_8.novo NS_T_9.bowtie NS_T_9.bwa NS_T_9.novo NV_T_1.bowtie NV_T_1.bwa NV_T_1.novo NV_T_2.bowtie NV_T_2.bwa NV_T_2.novo NV_T_3.bowtie NV_T_3.bwa NV_T_3.novo

seqc2/somaticseq/utilities/neusomatic/reformat_neuVcf.py -infile SEQC_wg1/E_WGS_${sample}.dedup/checkpoint_neusomatic_*_epoch*/output_.vcf -outfile NeuSomatic-E/VCFs/NeuSomatic.${sample}.vcf -tumor $sample

done

ii) Use GATK CombineVariants to make the NeuSomatic.All.vcf.gz file.

iii) Split it into SNV and INDEL separately

v3/somaticseq/vcfModifier/splitVcf.py -infile NeuSomatic-E/NeuSomatic.All.vcf.gz -snv NeuSomatic.sSNV.vcf -indel NeuSomatic.sINDEL.vcf

Step 12) Label SomaticSeq-based calls with NeuSomatic classification

SNV

seqc2/somaticseq/utilities/neusomatic/compare_Goldset_with_Neusomatic_E_S_calls.py -somaticseq sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.vcf.gz -nss NeuSomatic-S/NeuSomatic.sSNV.vcf.gz -nse NeuSomatic-E/NeuSomatic.sSNV.vcf.gz -outfile sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.Neu.vcf -neuonly NeuSomaticOnly.sSNV.vcf

seqc2/somaticseq/utilities/vcfsorter.pl GRCh38.d1.vd1.dict NeuSomaticOnly.sSNV.vcf | bgzip > NeuSomaticOnly.sSNV.vcf.gz

INDEL

seqc2/somaticseq/utilities/SEQC2_postProcess_InDel_VCF.py -infile sINDEL.MDKT.dedup_all.SPP.2ndPass.lowVafRescued.vcf.gz -outfile sINDEL.MDKT.dedup_all.SPP.2ndPass.lowVafRescued.properIndels.vcf

seqc2/somaticseq/utilities/neusomatic/compare_Goldset_with_Neusomatic_E_S_calls.py -somaticseq sINDEL.MDKT.dedup_all.SPP.2ndPass.lowVafRescued.properIndels.vcf.gz -nss NeuSomatic-S/NeuSomatic.sINDEL.0.25.vcf.gz -nse NeuSomatic-E/NeuSomatic.sINDEL.vcf.gz -outfile sINDEL.MDKT.dedup_all.SPP.2ndPass.lowVafRescued.properIndels.Neu.vcf -neuonly NeuSomaticOnly.sINDEL.vcf

seqc2/somaticseq/utilities/vcfsorter.pl GRCh38.d1.vd1.dict NeuSomaticOnly.sINDEL.vcf | bgzip > NeuSomaticOnly.sINDEL.vcf.gz

Step 13) Manual inspections for discrepant calls between SomaticSeq and NeuSomatic

Use Fisher's Exact Test to print out variant calls where SomaticSeq and NeuSomatic calls have shown major discrepancies. We manually inspected them with IGV, and demote them out of the high-confidence set if the alignments are suspicious. We also promoted a few into the high-confidence set if the alignments are clean.

seqc2/somaticseq/utilities/neusomatic/neusomatic_vs_somaticseq_discrepency_counter.py -infile sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.Neu.vcf.gz

Step 14) Re-label based on insights gained from NeuSomatic and SomaticSeq discrepancies

A few insights after manually looking at IGV alignments for calls that SomaticSeq and NeuSomatic disagree

SomaticSeq and NeuSomatic-E classifications are extremely correlated. SomaticSeq and NeuSomatic-S are less correlated, relatively speaking. We manually inspected calls where SomaticSeq and NeuSomatic-E&S showed significant discrepancies, and manually relabeled those calls. Many of them were "downgraded" to LowQual when we weren't so sure upon IGV visualization. We additionally inspected hundreds of calls where SomaticSeq and NeuSomatic-S showed differences (i.e., HighConf calls with NeuSomaticS<20), and manually changed some of the confidence level labels as well.
Calls with a large number of REJECT classifications (nREJECTS) instead of simply "not called" in many samples/replicates are more likely to be actual false positives, whereas the latter tend to be low VAF variants that do not appear in all the samples. So, we decided to be conservative including variant calls in the truth set, and "demoted" HighConf and MedConf SNV calls with nREJECTS>2 to LowQual.
For Unclassified SNV calls with either NeuSomatic-S or NeuSomatic-E greater than 30, and with nPASSES > nREJECTS, they are re-labeled LowConf, because many of them are in very low MQ regions, or very messy regions. That are likely false positives, but in regions where we just can't be sure of anything with short read technologies.

seqc2/somaticseq/utilities/neusomatic/modify_confLabel.py -original sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.Neu.vcf.gz -mod inspecting_discrepant_calls.xlsx -maxREJECS 2 -outfile sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.Neu.reLabeled.vcf > SNV.labelChanged

seqc2/somaticseq/utilities/neusomatic/modify_confLabel.py -original sINDEL.MDKT.dedup_all.SPP.2ndPass.lowVafRescued.properIndels.Neu.vcf.gz -mod inspecting_discrepant_calls.xlsx -maxREJECS 4 -outfile sINDEL.MDKT.dedup_all.SPP.2ndPass.lowVafRescued.properIndels.Neu.reLabeled.vcf --promote-long-deletions > Indel.labelChanged

Step 15) Extract calls rescued by NeuSomatic (i.e., not found at all by SomaticSeq)

- 8 LowConf SNVs and 6 LowConf Indels were added to the super set

seqc2/somaticseq/utilities/neusomatic/reformat_neuOnlyVcf.py -gold sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.Neu.reLabeled.vcf.gz -neuVcf NeuSomaticOnly.sSNV.vcf.gz -neuMod inspecting_discrepant_calls.xlsx -outfile NeuRescued.sSNV.vcf

seqc2/somaticseq/utilities/neusomatic/reformat_neuOnlyVcf.py -gold sINDEL.MDKT.dedup_all.SPP.2ndPass.lowVafRescued.properIndels.Neu.reLabeled.vcf.gz -neuVcf NeuSomaticOnly.sINDEL.vcf.gz -neuMod inspecting_discrepant_calls.xlsx -outfile NeuRescued.sINDEL.vcf

Step 16) Relabel and rescue calls in 1300X data sets

# Relabel based on stringent criteria, and output calls only in >1300X data sets:

seqc2/somaticseq/utilities/rescue_by_1300X.py -gold sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.Neu.reLabeled.vcf.gz -deeps NeuSomatic-S/NeuSomaticS.bwa_1000X_SNV.vcf.gz NeuSomatic-S/NeuSomaticS.novo_1000X_SNV.vcf.gz NeuSomatic-S/NeuSomaticS.bowtie_1000X_SNV.vcf.gz -callable BED/ConsensusCallableRegions.bed -outfile sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.Neu.reLabeled.1300x.vcf > Uniq.sSNV.1300X.vcf

# Sort

cat Uniq.sSNV.1300X.vcf | seqc2/somaticseq/utilities/vcfsorter.pl GRCh38.d1.vd1.dict - | bgzip > Uniq.sSNV.1300X.vcf.gz

seqc2/somaticseq/utilities/DeeperSeqOnly_TsvAndVcf.py -vcfin Uniq.sSNV.1300X.vcf.gz -tsvin 63BAMs.sSNV.MSDUKT.1000X.tsv.gz -outfile 1000X.snv.toAdd.vcf

# Jettison 1300X only calls where germline signal > 1% VAF in normal

cat 1000X.snv.toAdd.vcf | egrep '^#|;NVAF=0.00' | bgzip > 1000X.snv.toAdd.vcf.gz

Step 17) Combine to make final super set and high-confidence call set

Label PASS for those deemed as high-confidence calls

vcf-concat NeuRescued.sSNV.vcf.gz 1000X.snv.toAdd.vcf.gz sSNV.MSDUKT.dedup_all.SPP.2ndPass.lowVafRescued.Neu.reLabeled.1300x.vcf.gz | seqc2/somaticseq/utilities/vcfsorter.pl GRCh38.d1.vd1.dict - | awk -F "\t" '{ if ( $7 ~ /HighConf|MedConf/ && $8 !~ /ArmLossInNormal/ ) $7="PASS;"$7 }'1 OFS='\t' | bgzip > sSNV.MSDUKT.superSet.v1.1.vcf.gz

Step 18) Extract sequencing information from PacBio for each variant call using a modern version of SomaticSeq

# Install SomaticSeq so somatic_vcf2tsv.py will be in your PATH

somatic_vcf2tsv.py -myvcf .sSNV.MSDUKT.superSet.v1.1.vcf.gz -nbam HCC1395BL.SqII.bam -tbam HCC1395.SqII.bam -ref GRCh38.d1.vd1.fa -minMQ 1 -minBQ 0 -mincaller 0 -outfile sSNV.MSDUKT.superSet.v1.1.PACB_annotated.tsv

somatic_vcf2tsv.py -myvcf .sINDEL.MDKT.superSet.v1.1.vcf.gz -nbam HCC1395BL.SqII.bam -tbam HCC1395.SqII.bam -ref GRCh38.d1.vd1.fa -minMQ 1 -minBQ 0 -mincaller 0 -outfile sINDEL.MDKT.superSet.v1.1.PACB_annotated.tsv

Step 19) Add linguistic sequence complexity

Phred-scaled Linguistic Sequence Complexity (LSC) within 80-bp window of the mutation site, that calculates subsequence up to 20 bps.

seqc2/somaticseq/utilities/add_linguistic_sequence_complexity_to_vcf.py -infile sSNV.MSDUKT.superSet.v1.1.vcf.gz -outfile sSNV.MSDUKT.superSet.v1.1a.vcf -reference GRCh38.d1.vd1.fa

seqc2/somaticseq/utilities/add_linguistic_sequence_complexity_to_vcf.py -infile sINDEL.MDKT.superSet.v1.1.vcf.gz -outfile sINDEL.MDKT.superSet.v1.1a.vcf -reference GRCh38.d1.vd1.fa

# Compress

bgzip -f sSNV.MSDUKT.superSet.v1.1a.vcf

bgzip -f sINDEL.MDKT.superSet.v1.1a.vcf

Step 20) Annotate with PacBio sequencing information* and remove low-MQ and/or low-LSC calls that are unsupported by PacBio

sSNV.MSDUKT.superSet.v1.2.vcf.gz and sINDEL.MDKT.superSet.v1.2.vcf.gz are the all-inclusive somatic mutation call sets.

seqc2/somaticseq/utilities/attach_PacBioTsvDepth.py -myvcf sSNV.MSDUKT.superSet.v1.1a.vcf.gz -mytsv PacBio/sSNV.MSDUKT.superSet.v1.1.PACB_annotated.tsv.gz -outfile sSNV.MSDUKT.superSet.v1.2.vcf

seqc2/somaticseq/utilities/attach_PacBioTsvDepth.py -myvcf sINDEL.MDKT.superSet.v1.1a.vcf.gz -mytsv PacBio/sINDEL.MDKT.superSet.v1.1.PACB_annotated.tsv.gz -outfile sINDEL.MDKT.superSet.v1.2.vcf

# Compress

bgzip -f sSNV.MSDUKT.superSet.v1.2.vcf

bgzip -f sINDEL.MDKT.superSet.v1.2.vcf

Step 21) Simplify release VCFs that only include high-confidence calls in the high-confidence region

high-confidence_sSNV_in_HC_regions_v1.2.vcf.gz and high-confidence_sINDEL_in_HC_regions_v1.2.vcf.gz

seqc2/somaticseq/utilities/makeSeqc2HighConfidenceCallSets/minimize_highconf_vcfs.py -infile sSNV.MSDUKT.superSet.v1.2.vcf.gz -outfile high-confidence_sSNV_in_HC_regions_v1.2.vcf

seqc2/somaticseq/utilities/makeSeqc2HighConfidenceCallSets/minimize_highconf_vcfs.py -infile sINDEL.MDKT.superSet.v1.2.vcf.gz -outfile high-confidence_sINDEL_in_HC_regions_v1.2.vcf

# Compress

bgzip high-confidence_sSNV_in_HC_regions_v1.2.vcf

bgzip high-confidence_sINDEL_in_HC_regions_v1.2.vcf