Metacallers & Infrastructures

SomaticSeq meta caller

SomaticSeq - General process

We use SAMtools [30] and GATK HaplotypeCaller on the tumor and normal BAM files to obtain a number of independent sequencing features that have predictive values for their somatic mutation statuses, e.g., mapping quality, base call quality, strand bias, depth of coverage, tail distance bias, etc. Some caller features, e.g., somatic mutation scores based on its distinct statistics, are also included. For the DREAM Challenge and real data, we also consider whether the site is in dbSNP. Two of the most important features in the adaptively boosted classifiers include the root-mean-square mapping quality score and the number of read mismatches compared to the reference.

For the results described in this study, we have used P≥0.7 as the cut-off for our SomaticSeq results, i.e., a candidate site of P≥0.7 is considered a PASS call, whereas a candidate site of P<0.7 is considered LowQual.

Since eight of the top 18 features related directly to sequencing depth, it is important for the trained model to have a comparable sequencing depth as the target set. Thus, it would not be appropriate to use a 30 × whole-genome sequence trained model to predict somatic mutations in a 500 × targeted sequencing

From http://bioinform.github.io/somaticseq/data.html

Intermediate files for the analyses were hosted on a Google Drive that we no longer own. We'll post the new URL when we find a new place to host them. We're aware the original links do not work anymore.
Analysis files, i.e., files generated during analysis, can be found here. The pre-built classifiers there were built based on version 1 described in the genome biology paper, which is no longer compatible with later versions because some metrics were added and some were removed. For v2 classifiers, try this instead.

Where ‘here’ is a dead link and ‘this’ refers to the dead link here: https://drive.google.com/drive/folders/0B9pfRlnkG-Z7STNNczk4ak5xSmM

2) Trained on Stage 2, testing on variants of Stage 3 data

A) Tumor has three different VAFs (50 %, 33 %, and 20 %) representing three different subclones.

B) mixed the normal and tumor data at 95:5 ratio

C) mixed the tumor and normal data at a 70:30 ratio

D) D was the normal from Setting B and tumor from Setting C

Specific examinations

1) Train on Stage 2, test on straight Stage 3

mixed Stage 2 tumor/normal data at 70:30 ratio for training, test data was Stage 3

results were averaged over ten cross-validation results (the training set consists of half of the entire data set, randomly chosen). We performed twofold cross-validation ten times

3) In Silico Titration

4) SomaticSpike

5) COLO-829, CLL1 trained on Stage 3 data

Mutect)

dbSNP v.138, COSMIC v.69, Panel Of Normal based on Phase 1 of the 1kGP as resource files for the real sequencing data. Did not supply COSMIC for DREAM Challenge, because synthetic mutations were randomly chosen and not enriched in COSMIC sites. In our in silico titration and SomaticSpike experiments, none of these databases was used.

SomaticSniper) mapping quality cut-off 25, base quality cut-off 15, prior somatic mutation probability 10 ⁻⁴

VarScan2) mapping quality cut-off 25, base quality cut-off of 20.

JointSNVMix2) convergence threshold of 0.01 in training, somatic probability ≥0.95

VarDict) relaxed the variant depth filter from 4 to 2, and the FET p-value cut-off from 0.05 to 0.15. allowed each call to fail for up to two out of 20 VarDict filters.

Installation:

dbsnp common_all here
cosmic mutation data genome screens here

Looks difficult to work with. Testimony:

"I had been trying to operate SomaticSeq for 3 months before I switched to Bcbio. It s meant to be a great approach to mutation calling but it was so hard coded that I couldnt fix it to get it run for my own computer. "

Bcbio inclusion in SomaticSeq to exploit its decision tree seems difficult. There is a claim of a "more or less working" version of SomaticSeq that can include bcbio vcfs in its results. The dev asked Chapman, 10 days before the last commit to the modified SomaticSeq:

"I'm trying to make SomaticSeq work with the bcbio output vcf files at the moment, and the main issue seems to be converting the vcf data (+ some samtools/haplotypecaller output) into a nice tab delimited file for the ada stochastic boosting algorithm. The ada model builder/predictor R script is very simple compared to the previous steps. I was wondering if there are any tools in bcbio that could help. Like a magic vcf2tab function "

Chapman et al reported previously evaluating SomaticSeq for inclusion in bcbio and decided against, stating:

"Unfortunately we don't have any code written to help with this. It's still something we want to work on but after working through what it would take to implement, we realized it was a pretty large project and would take a big time investment. We'd definitely have interest in any work you do on integration. Thanks again for looking at this. "

Using bcbio meta caller

Existing bcbio variant calling pipelines

Example study comparing combinations of assembly and variant calling pipelines, with explicit shell commands

Trio pipeline example with explicit shell commands

bcbio guidance on installing new variant callers. more guidance

advice for adding variant callers to bcbio

bcbio authors summarizing their attempts at incorporating popular callers

our needs are not 100% aligned with theirs, but it would be nice if they took interest in maintaining any callers we add.

bcbio discussion on on including MuSE (Feb2017)

Varscan is under maintenance (Mar2017)

swi9 2017 rank-combination meta caller

source, study

Usage: /pathTo/bin/Rscript rank_combination.R combined_out_file.txt tool_1.vcf tool_2.vcf ... tool_n.vcf

Vcf files should have a sixth column with the confidence score from the variant callers for ranking.

ExScalibur 2015 SNV plural caller

Germline variant callers include GATK UnifiedGenotyper [15], GATK HaplotypeCaller [15], FreeBayes [16], SAMtools mpileup/bcftools [17], Isaac Variant Caller (IVC) [18] and Platypus [19]. Somatic variant callers include MuTect [20], Shimmer [21], SomaticSniper [22], Strelka [23], VarScan2 [24] and Virmid [25].

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0135800

Lumpy 2014 SV meta caller

Lumpy can integrate multiple signals. If all of our callers have continuous variable confidence scores they should fit.

Cake 2013 SNV meta caller

from Welcome Trust, Adams

uses bambino, caveman, mpileup, varscan 2

integrates four publicly available somatic variant-calling algorithms to identify single nucleotide variants, Bambino, CaVEMan, SAMtools mpileup, and VarScan 2 with extra filtering. merge, consensus, filter model

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3740632/

http://cakesomatic.sourceforge.net/

CombineCalls(?) 2013

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4035752/

Report abuse