Orthogonal Validations

For validation "calls," we did not implement any mutation callers.

Mutation callers were used on the data sets that created the "gold set," so validation procedure calls for a different prior. Thus, simple rules were followed that reflect the following criteria:

The exact thresholds for the rules are different for the two data sets, because the AmpliSeq is ~ 2000X in depth, and the Ion Torrent data is ~ 34 for tumor and ~ 47 for normal. 

AmpliSeq on NextSeq 500 (2000X)

Three replicates of tumor and normal were sequenced, each at ~ 700X in depth.

Each pair of .fastq.gz files were aligned with BWA MEM as follows (no mark duplicate because of PCR enrichment)

docker run --rm -u $UID -v /:/mnt lethalfang/bwa:0.7.17_samtools bash -c \

"bwa mem \

-R '@RG\tID:AmpliSeqValidation\tLB:AmpliSeq\tPL:illumina\tSM:HCC1395' \

-M -t 12 \

/mnt/SEQC2_Resources/GRCh38.d1.vd1.fa \

/mnt/AmpliSeq_Validation/Breast-Carcinoma-SC-CRL-2324-1_180531_M04643_0036_000000000-BTW7M_L001_R1.fastq.gz \

/mnt/AmpliSeq_Validation/Breast-Carcinoma-SC-CRL-2324-1_180531_M04643_0036_000000000-BTW7M_L001_R2.fastq.gz \

| samtools view -Sbh - | samtools sort -m 4G --threads 12 -o /mnt/AmpliSeq_Validation/HCC1395_1.bam" 




Figure 1

AmpliSeq validation of the v1.0rc showed that lower-confidence calls have a higher probability of getting positively validated for low-VAF variants, along with other simple metrics, i.e., number of PASS and REJECT samples. 

Based on that knowledge, and additional higher-depth data sets, confidence-levels were re-calibrated for low-VAF lower-confidence level calls. 

All the material you need to reproduce AmpliSeq validation (except for the BAM files) are archived here. The annotation files are AmpliSeq.SNV.ValidationAnnotation.tsv and AmpliSeq.INDEL.ValidationAnnotation.tsv. 

Ion Torrent WES on Ion S5 XL

Align and mark duplicates as follows for both tumor and normal

docker run --rm -u $UID -v /sc1:/sc1 -t lethalfang/tmap:3.0.1 \

/TMAP/tmap mapall -n 24 \

-f /sc1/SEQC2_Resources/GRCh38.d1.vd1.fa \

-r /sc1/IonTorrent/INT_EA_T_1_IonXpress_002.fastq.gz \

-v -Y -u \

-s /sc1/IonTorrent/INT_EA_T_1_IonXpress_002.bam \

-o 1 stage1 map4


samtools sort -m 4G --threads 4 -o /sc1/IonTorrent/INT_EA_T_1_IonXpress_002.sort.bam /sc1/IonTorrent/INT_EA_T_1_IonXpress_002.bam


docker run --rm -v /:/mnt -u $UID lethalfang/picard:2.18.4 \

java -Xmx16g -jar /opt/picard.jar MarkDuplicates \

I=/mnt/IonTorrent/INT_EA_N_1_IonXpress_002.sort.bam \

M=/mnt/IonTorrent/INT_EA_N_1_IonXpress_002.sort.dedup \

O=/mnt/IonTorrent/INT_EA_N_1_IonXpress_002.sort.dedup.bam \

CREATE_INDEX=true ASSUME_SORTED=true \

TMP_DIR=/mnt/IonTorrent/2018-07-31_15-04-44_508063456.temp

Agilent SureSelect v6 Exon UTR was used for capture. We downloaded the hg19 bed file, and lifted it over to GRCh38 at https://genome.ucsc.edu/cgi-bin/hgLiftOver. And then post-process the output BED file as follows:

cat hglft_genome_17c3_e7160.bed | egrep -v '_' > hglft_genome_17c3_e7160.majors.bed

bedtools sort -faidx GRCh38.d1.vd1.fa.fai -i hglft_genome_17c3_e7160.majors.bed > hglft_genome_17c3_e7160.majors.sort.bed

mergeBed -i hglft_genome_17c3_e7160.majors.sort.bed > hglft_genome_17c3_e7160.majors.sort.merged.GRCh38.bed

All the material you need to reproduce AmpliSeq validation (except for the BAM files) are archived here. The annotation files are IonTorrent.SNV.ValidationAnnotation.tsv and IonTorrent.INDEL.ValidationAnnotation.tsv.

Code to plot 95% binomial distribution confidence interval

from scipy.stats import beta

import math

def binom_interval(success, total, confint=0.95):

    quantile = (1 - confint) / 2

    lower = beta.ppf(quantile, success, total - success + 1)

    upper = beta.ppf(1 - quantile, success + 1, total - success)

    if math.isnan(lower): lower = 0

    if math.isnan(upper): upper = 1

    return (lower, upper)


# For 5% VAF at DP=100, the 95% interval is:

lowerVAF, upperVAF = binom_interval(5, 100)