Pipelines

GDC pipeline notes

GDC parameters

Mutect2 - GATK nightly-2016-02-25-gf39d340

The MuTect2 pipeline employs a "Panel of Normals" to identify additional germline mutations. This panel is generated using genomes from TCGA blood normal samples from thousands of individuals that were curated and confidently assessed to be cancer-free.

java -jar GenomeAnalysisTK.jar \ -T MuTect2 \ -R <reference> \ -L <region> \ -I:tumor <tumor.bam> \ -I:normal <normal.bam> \ --normal_panel <pon.vcf> \ --cosmic <cosmic.vcff> \ --dbsnp <dbsnp.vcf> \ --contamination_fraction_to_filter 0.02 \ -o <mutect_variants.vcf> \ --output_mode EMIT_VARIANTS_ONLY \ --disable_auto_index_creation_and_locking_when_reading_rods

Muse - MuSEv1.0rc_submission_c039ffa

MuSE call \ -f <reference> \ -r <region> \ <tumor.bam> \ <normal.bam> \ -O <intermediate_muse_call.txt>

MuSE sump \ -I <intermediate_muse_call.txt> \ -E \ -D <dbsnp_known_snp_sites.vcf> \ -O <muse_variants.vcf>

Sniper v1.0.5.0

bam-somaticsniper \ -q 0 \ -Q 15 \ -s 0.01 \ -T 0.85 \ -N 2 \ -r 0.001 \ -n NORMAL \ -t TUMOR \ -F vcf \ -f ref.fa \ <tumor.bam> \ <normal.bam> \ <somaticsniper_variants.vcf>

Varscan2

Mpileup; Samtools 1.1

samtools mpileup \ -f <reference> \ -q 1 \ -B \ <normal.bam> \ <tumor.bam> > <intermediate_mpileup.pileup>

Varscan Somatic; Varscan.v2.3.9

java -jar VarScan.jar somatic \ <intermediate_mpileup.pileup> \ <output_path> \ --mpileup 1 \ --min-coverage 8 \ --min-coverage-normal 8 \ --min-coverage-tumor 6 \ --min-var-freq 0.10 \ --min-freq-for-hom 0.75 \ --normal-purity 1.0 \ --tumor-purity 1.00 \ --p-value 0.99 \ --somatic-p-value 0.05 \ --strand-filter 0 \ --output-vcf

Varscan ProcessSomatic; Varscan.v2.3.9

java -jar VarScan.jar processSomatic \ <intermediate_varscan_somatic.vcf> \ --min-tumor-freq 0.10 \ --max-normal-freq 0.05 \ --p-value 0.07

SomaticSeq pipeline notes

General process

We use SAMtools [30] and GATK HaplotypeCaller on the tumor and normal BAM files to obtain a number of independent sequencing features that have predictive values for their somatic mutation statuses, e.g., mapping quality, base call quality, strand bias, depth of coverage, tail distance bias, etc. Some caller features, e.g., somatic mutation scores based on its distinct statistics, are also included. For the DREAM Challenge and real data, we also consider whether the site is in dbSNP. Two of the most important features in the adaptively boosted classifiers include the root-mean-square mapping quality score and the number of read mismatches compared to the reference.

For the results described in this study, we have used P≥0.7 as the cut-off for our SomaticSeq results, i.e., a candidate site of P≥0.7 is considered a PASS call, whereas a candidate site of P<0.7 is considered LowQual.

Since eight of the top 18 features related directly to sequencing depth, it is important for the trained model to have a comparable sequencing depth as the target set. Thus, it would not be appropriate to use a 30 × whole-genome sequence trained model to predict somatic mutations in a 500 × targeted sequencing

From http://bioinform.github.io/somaticseq/data.html

Intermediate files for the analyses were hosted on a Google Drive that we no longer own. We'll post the new URL when we find a new place to host them. We're aware the original links do not work anymore.
Analysis files, i.e., files generated during analysis, can be found here. The pre-built classifiers there were built based on version 1 described in the genome biology paper, which is no longer compatible with later versions because some metrics were added and some were removed. For v2 classifiers, try this instead.

Where ‘here’ is a dead link and ‘this’ refers to the dead link here: https://drive.google.com/drive/folders/0B9pfRlnkG-Z7STNNczk4ak5xSmM

SomaticSeq Testing/Training Strategies

1) Train on Stage 2, test on straight Stage 3

mixed Stage 2 tumor/normal data at 70:30 ratio for training, test data was Stage 3

results were averaged over ten cross-validation results (the training set consists of half of the entire data set, randomly chosen). We performed twofold cross-validation ten times

2) Trained on Stage 2, testing on variants of Stage 3 data

A) Tumor has three different VAFs (50 %, 33 %, and 20 %) representing three different subclones.
B) mixed the normal and tumor data at 95:5 ratio
C) mixed the tumor and normal data at a 70:30 ratio
D) D was the normal from Setting B and tumor from Setting C

3) In Silico Titration

4) SomaticSpike

5) COLO-829, CLL1 trained on Stage 3 data

SomaticSeq Caller Parameters

Mutect) dbSNP v.138, COSMIC v.69, Panel Of Normal based on Phase 1 of the 1kGP as resource files for the real sequencing data. Did not supply COSMIC for DREAM Challenge, because synthetic mutations were randomly chosen and not enriched in COSMIC sites. In our in silico titration and SomaticSpike experiments, none of these databases was used.

SomaticSniper) mapping quality cut-off 25, base quality cut-off 15, prior somatic mutation probability 10 ⁻⁴

VarScan2) mapping quality cut-off 25, base quality cut-off of 20.

JointSNVMix2) convergence threshold of 0.01 in training, somatic probability ≥0.95

VarDict) relaxed the variant depth filter from 4 to 2, and the FET p-value cut-off from 0.05 to 0.15. allowed each call to fail for up to two out of 20 VarDict filters.

Report abuse