Identify high-confidence Regions
GATK CallableLoci
We used GATK CallableLoci to identify callable regions in each of the 63 tumor-normal BAM files (i.e., 126 BAM files).
GATK CallableLoci parameters:
Use 8x the average coverage for each sample (as calculated by GATK DepthofCoverage) as the maxDepth
Minimum mapping quality is 20, same as GiaB settings.
For each Site and Aligner, also do the same routine for the normal samples, and use the intersection for each "replicate"
The "consensus callable regions" are MajorityAlignersCallable.bed and mergeBed.MajorityAlignersCallable.bed (which is just mergeBed -i MajorityAlignersCallable.bed).
An example GATK3 command used in those tasks are
docker run --rm -u 83748 broadinstitute/gatk3:3.8-0 \
java -Xmx8g -jar GenomeAnalysisTK.jar \
-T CallableLoci \
-R /PATH/TO/GRCh38.d1.vd1.fa \
-I /PATH/TO/WGS_IL_N_3.bwa.dedup.bam \
--maxDepth 376 \
--maxFractionOfReadsWithLowMAPQ 0.1 \
--maxLowMAPQ 1 \
--minBaseQuality 20 \
--minMappingQuality 20 \
--minDepth 10 \
--minDepthForLowMAPQ 10 \
--summary /PATH/TO/WGS_IL_N_3.bwa.summary \
-o /PATH/TO/WGS_IL_N_3.bwa.Callable.bed
Count regions that are reproducibly callable
We used a SomaticSeq utility to count and identify common loci among the 63 tumor's WGS_SITE_T_REPLICATE.ALIGNER.Callable.bed with the following scheme. This utility will keep track and count regions that have appeared in a BED file (i.e., GATK CallableLoci's output), i.e., an example command:
lociCounterWithLabels.py \
-fai genome.fa.fai \
-beds replicate1.bed replicate2.bed replicate3.bed \
-labels Rep1 Rep2 Rep3
Imagine a contigX in genome.fa has a length of 1000, and the 3 bed files have the following content:
# replicate1.bed:
contigX 100 300
contigX 400 600
# replicate2.bed
contigX 80 320
contigX 400 550
# replicate3.bed
contigX 90 310
contigX 420 700
It will produce the following output, where the 4th column keeps track of how many times this region has been counted from the bed files, and the 5th column denotes the corresponding BED file (or its label) that overlaps this region:
X 0 80 0 .
X 80 90 1 Rep2
X 90 100 2 Rep2,Rep3
X 100 300 3 Rep1,Rep2,Rep3
X 300 310 2 Rep2,Rep3
X 310 320 1 Rep2
X 320 400 0 .
X 400 420 2 Rep1,Rep2
X 420 550 3 Rep1,Rep2,Rep3
X 550 600 2 Rep1,Rep3
X 600 700 1 Rep3
X 700 1000 0 .
Briefly as follows
1) For each sequencing centers and each aligner, we keep the callable loci where the majority of the data sets returned the region. An example file name output for this step is IL.bwa.MajorityCallable.bed. The regions there represents regions where at least 2 out of 3 replicates from IL for BWA has been identified as CALLABLE. We also grouped LLU, NCI, and EATRIS data sets into a single "site" as Other. For each aligner, then there are 5 BED files, i.e., IL, NS, FD, NV, and Other.
2) For each aligner, we used the same utility to get the loci where the majority of the loci returned from step 1, i.e., BWA.MajoritySitesCallable.bed was generated from IL.bwa.MajorityCallable.bed, NS.bwa.MajorityCallable.bed, FD.bwa.MajorityCallable.bed, NV.bwa.MajorityCallable.bed, and Other.bwa.MajorityCallable.bed.
3) Consolidate BWA.MajoritySitesCallable.bed, Novo.MajoritySitesCallable.bed, and Bowtie.MajoritySitesCallable.bed into MajorityAlignersCallable.bed.
4) Use Bedtools Merge to merge adjacent regions: mergeBed.MajorityAlignersCallable.bed.
Commands and Scripts
Those three scripts were run successively to create the consensus callable bed files
Result Files
The following are the commands to generate those output BED files following the GATK CallableLoci outputs. Those files were zipped and uploaded to CGC: Consensus_GATK_CallableLoci.zip.
The last file containing information for every data set is MajorityAlignersCallable.bed.gz and mergeBed.MajorityAlignersCallable.bed.
The region in MajorityAlignersCallable.bed consists 2.73 billion bp, or about 88% of the entire human genome. This is the file used to annotate FLAGS=NonCallable in 1st release of Gold Set for variants not in it.
As can be seen from the table below, of the somatic mutation candidates outside those area, over 95% of them are Tier 4 calls or worse (i.e., likely false positives), and only 1.5% calls are classified as Tier 1. Whereas in the callable regions, the fractions are 56.5% and 42.3%, respectively.
Once we have validation data sets, additional confidence-level information will be annotated in additional to the initial tiers, which is crude way to assign confidence-level based on cross-site and cross-aligner reproducibility of calls.
The stats below have mutations in 6p, 16q, chrX, and chrY removed
High-Confidence Regions
To create the "High-Confidence Regions" as a part of the benchmark call set, 20-bp regions spanning LowConf and Unclassified positions were additionally subtracted from the mergeBed.MajorityAlignersCallable.bed described above.
In the latest iteration, it is named High-Confidence_Regions_v1.2.bed here.
GiaB high-confidence regions:
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/latest/README_NISTv3.3.2.txt