created by GATK_Team
on 2017-12-29
BQSR stands for Base Quality Score Recalibration. In a nutshell, it is a data pre-processing step that detects systematic errors made by the sequencing machine when it estimates the accuracy of each base call.
Note that this base recalibration process (BQSR) should not be confused with variant recalibration (VQSR), which is a sophisticated filtering technique applied on the variant callset produced in a later step. The developers who named these methods wish to apologize sincerely to anyone, especially Spanish-speaking users, who get tripped up by the similarity of these names.
Contents
Base quality scores are per-base estimates of error emitted by the sequencing machines; they express how confident the machine was that it called the correct base each time. For example, let's say the machine reads an A nucleotide, and assigns a quality score of Q20 -- in Phred-scale, that means it's 99% sure it identified the base correctly. This may seem high, but it does mean that we can expect it to be wrong in one case out of 100; so if we have several billion base calls (we get ~90 billion in a 30x genome), at that rate the machine would make the wrong call in 900 million bases -- which is a lot of bad bases. The quality score each base call gets is determined through some dark magic jealously guarded by the manufacturer of the sequencing machines.
Why does it matter? Because our short variant calling algorithms rely heavily on the quality score assigned to the individual base calls in each sequence read. This is because the quality score tells us how much we can trust that particular observation to inform us about the biological truth of the site where that base aligns. If we have a base call that has a low quality score, that means we're not sure we actually read that A correctly, and it could actually be something else. So we won't trust it as much as other base calls that have higher qualities. In other words we use that score to weigh the evidence that we have for or against a variant allele existing at a particular site.
Unfortunately the scores produced by the machines are subject to various sources of systematic (non-random) technical error, leading to over- or under-estimated base quality scores in the data. Some of these errors are due to the physics or the chemistry of how the sequencing reaction works, and some are probably due to manufacturing flaws in the equipment.
Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. For example we can identify that, for a given run, whenever we called two A nucleotides in a row, the next base we called had a 1% higher rate of error. So any base call that comes after AA in a read should have its quality score reduced by 1%. We do that over several different covariates (mainly sequence context and position in read, or cycle) in a way that is additive. So the same base may have its quality score increased for one reason and decreased for another.
This allows us to get more accurate base qualities overall, which in turn improves the accuracy of our variant calls. To be clear, we can't correct the base calls themselves, i.e. we can't determine whether that low-quality A should actually have been a T -- but we can at least tell the variant caller more accurately how far it can trust that A. Note that in some cases we may find that some bases should have a higher quality score, which allows us to rescue observations that otherwise may have been given less consideration than they deserve. Anecdotally our impression is that sequencers are more often over-confident than under-confident, but we do occasionally see runs from sequencers that seemed to suffer from low self-esteem.
This procedure can be applied to BAM files containing data from any sequencing platform that outputs base quality scores on the expected scale. We have run it ourselves on data from several generations of Illumina, SOLiD, 454, Complete Genomics, and Pacific Biosciences sequencers.
The base recalibration process involves two key steps: first the BaseRecalibrator tool builds a model of covariation based on the input data and a set of known variants, producing a recalibration file; then the ApplyBQSR tool adjusts the base quality scores in the data based on the model, producing a new BAM file. The known variants are used to mask out bases at sites of real (expected) variation, to avoid counting real variants as errors. Outside of the masked sites, every mismatch is counted as an error. The rest is mostly accounting.
There is an optional but highly recommended step that involves building a second model and generating before/after plots to visualize the effects of the recalibration process. This is useful for quality control purposes.
To build the recalibration model, this first tool goes through all of the reads in the input BAM file and tabulates data about the following features of the bases:
For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to the known variants resource (typically dbSNP). This information is output to a recalibration file in GATKReport format.
Note that the recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 2). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.
This second tool goes through all the reads again, using the recalibration file to adjust each base's score based on which bins it falls in. So effectively the new quality score is:
Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as variant calling. In addition, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.
The recalibration system is read-group aware, meaning it uses @RG tags to partition the data by read group. This allows it to perform the recalibration per read group, which reflects which library a read belongs to and what lane it was sequenced in on the flowcell. We know that systematic biases can occur in one lane but not the other, or one library but not the other, so being able to recalibrate within each unit of sequence data makes the modeling process more accurate. As a corollary, that means it's okay to run BQSR on BAM files with multiple read groups. However, please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.
A critical determinant of the quality of the recalibration is the number of observed bases and mismatches in each bin. This procedure will not work well on a small number of aligned reads. We usually expect to see more than 100M bases per read group; as a rule of thumb, larger numbers will work better.
You should almost always perform recalibration on your sequencing data. In human data, given the exhaustive databases of variation we have available, almost all of the remaining mismatches -- even in cancer -- will be errors, so it's super easy to ascertain an accurate error model for your data, which is essential for downstream analysis. For non-human data it can be a little bit more work since you may need to bootstrap your own set of variants if there are no such resources already available for you organism, but it's worth it.
Here's how you would bootstrap a set of known variants:
The main case figure where you really might need to skip BQSR is when you have too little data (some small gene panels have that problem), or you're working with a really weird organism that displays insane amounts of variation.
This shows recalibration results from a lane sequenced at the Broad by an Illumina GA-II in February 2010. This is admittedly not very recent but the results are typical of what we still see on some more recent runs, even if the overall quality of sequencing has improved. You can see there is a significant improvement in the accuracy of the base quality scores after applying the recalibration procedure. Note that the plots shown below are not the same as the plots that are produced by the AnalyzeCovariates tool.
Note: The scale for number of bases in the two graphs are different.
The recalibration report contains the following 5 tables:
Arguments Table
This is the table that contains all the arguments used to run BQSR for this dataset.
````
Argument Value covariate null defaultplatform null deletionscontextsize 6 forceplatform null insertionscontextsize 6 ... ````
Quantization Table
The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSR, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.
The default behavior (currently) is to use no quantization. You can override this by using the engine argument -qq
. With -qq 0
you don't quantize qualities, or -qq N
you recalculate the quantization bins using N bins.
````
QualityScore Count QuantizedScore 0 252 0 1 15972 1 2 553525 2 3 2190142 9 4 5369681 9 9 83645762 9 ... ````
ReadGroup Table
This table contains the empirical quality scores for each read group, for mismatches insertions and deletions.
````
ReadGroup EventType EmpiricalQuality EstimatedQReported Observations Errors SRR032768 D 40.7476 45.0000 2642683174 222475 SRR032766 D 40.9072 45.0000 2630282426 213441 SRR032764 D 40.5931 45.0000 2919572148 254687 SRR032769 D 40.7448 45.0000 2850110574 240094 SRR032767 D 40.6820 45.0000 2820040026 241020 SRR032765 D 40.9034 45.0000 2441035052 198258 SRR032766 M 23.2573 23.7733 2630282426 12424434 SRR032768 M 23.0281 23.5366 2642683174 13159514 SRR032769 M 23.2608 23.6920 2850110574 13451898 SRR032764 M 23.2302 23.6039 2919572148 13877177 SRR032765 M 23.0271 23.5527 2441035052 12158144 SRR032767 M 23.1195 23.5852 2820040026 13750197 SRR032766 I 41.7198 45.0000 2630282426 177017 SRR032768 I 41.5682 45.0000 2642683174 184172 SRR032769 I 41.5828 45.0000 2850110574 197959 SRR032764 I 41.2958 45.0000 2919572148 216637 SRR032765 I 41.5546 45.0000 2441035052 170651 SRR032767 I 41.5192 45.0000 2820040026 198762 ````
Quality Score Table
This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions.
````
ReadGroup QualityScore EventType EmpiricalQuality Observations Errors SRR032767 49 M 33.7794 9549 3 SRR032769 49 M 36.9975 5008 0 SRR032764 49 M 39.2490 8411 0 SRR032766 18 M 17.7397 16330200 274803 SRR032768 18 M 17.7922 17707920 294405 SRR032764 45 I 41.2958 2919572148 216637 SRR032765 6 M 6.0600 3401801 842765 SRR032769 45 I 41.5828 2850110574 197959 SRR032764 6 M 6.0751 4220451 1041946 SRR032767 45 I 41.5192 2820040026 198762 SRR032769 6 M 6.3481 5045533 1169748 SRR032768 16 M 15.7681 12427549 329283 SRR032766 16 M 15.8173 11799056 309110 SRR032764 16 M 15.9033 13017244 334343 SRR032769 16 M 15.8042 13817386 363078 ... ````
Covariates Table
This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.
````
ReadGroup QualityScore CovariateValue CovariateName EventType EmpiricalQuality Observations Errors SRR032767 16 TACGGA Context M 14.2139 817 30 SRR032766 16 AACGGA Context M 14.9938 1420 44 SRR032765 16 TACGGA Context M 15.5145 711 19 SRR032768 16 AACGGA Context M 15.0133 1585 49 SRR032764 16 TACGGA Context M 14.5393 710 24 SRR032766 16 GACGGA Context M 17.9746 1379 21 SRR032768 45 CACCTC Context I 40.7907 575849 47 SRR032764 45 TACCTC Context I 43.8286 507088 20 SRR032769 45 TACGGC Context D 38.7536 37525 4 SRR032768 45 GACCTC Context I 46.0724 445275 10 SRR032766 45 CACCTC Context I 41.0696 575664 44 SRR032769 45 TACCTC Context I 43.4821 490491 21 SRR032766 45 CACGGC Context D 45.1471 65424 1 SRR032768 45 GACGGC Context D 45.3980 34657 0 SRR032767 45 TACGGC Context D 42.7663 37814 1 SRR032767 16 AACGGA Context M 15.9371 1647 41 SRR032764 16 GACGGA Context M 18.2642 1273 18 SRR032769 16 CACGGA Context M 13.0801 1442 70 SRR032765 16 GACGGA Context M 15.9934 1271 31 ... ````
Updated on 2018-09-18
From amywilliams on 2018-01-17
Is there a newly updated page that documents best practices for humans that references the files from the resource bundle? I ran this a couple years ago and can use the same files (now with the latest versions) I did then — dbsnp, Mills_and_1000G_gold_standard, and 1000G_phase1 — but I now see there are several other potential relevant resources in the bundle. Is it suggested to use Axiom_Exome_Plus for whole genome sequence data? What about hapmap_3.3 or hapmap_3.3_grch38_pop_stratified or 1000G_omni2.5?
And is it now suggested that we only apply this to the intervals included in wgs_calling_regions.hg38.interval_list?
Thank you!
From Geraldine_VdAuwera on 2018-01-19
Hi @amywilliams, the definitive recommendations for exactly how to run this are now given in the form of reference implementations, which you can find linked here: https://software.broadinstitute.org/gatk/best-practices/workflow?id=11165
We still have some work to do to make this completely user-friendly, but this should at least get you started.
From stochasticsdrj on 2018-01-29
Dear GATK team,
Thank you for all your effort and excellent update with v4.
I am having an issue defining covariates e.g. previous command line code would have looked like this:
`java -jar CommandLineGATK_deploy.jar -Xmx4G -R REF.genome.fa -T BaseRecalibrator
-I sorted.deduped.merged.realigned.bam -knownSites dbsnp_REF.vcf -o
base_recalibration.table -nct 32 —useOriginalQualities —disable_indel_quals -cov
ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov
ContextCovariate`
How do I correctly define covariates in v4? I have combed through the “-h” and feel like I am missing something obvious and have appropriately changed “—known-sites”, “—use-original-qualities true”, “—disable_indel_quals”…deprecated?, “-nct”…deprecated?
`java -jar CommandLineGATK_deploy.jar -Xmx4G -R hg38.genome.fa -T BaseRecalibrator
-I sorted.deduped.merged.realigned.bam —known-sites dbsnp_146.hg38.vcf -O
base_recalibration.table —use-original-qualities true
Appreciate any help you can give please.
From berutti on 2018-02-06
I would queue to this question: I cannot find a replacement for —disable_indel_quals when applying recalibration in GATK4. This creates huge files especially with genomes because of these BI/BD tags. Since apparently most of the options changed their name between 3 and 4, does this feature still exist in GATK4, and if yes how was its name changed? Thanks for you help!
From esrabytk on 2018-02-10
Hi,
I performed BQSR using gatk-3.8 and applied my bam files without any problem. But when I switch to gatk-4.0.1.2 cannot create .table files although I use same vcf files and reference downloaded from GATK bundle. Here is my code and error message;
```
gatk BaseRecalibrator \
-R library/human_g1k_v37_decoy.fasta \
-I CF0098_redup.bam \
—known-sites library/dbsnp_138.b37.vcf \
—known-sites library/1000G_phase1.indels.b37.vcf \
—known-sites library/Mills_and_1000G_gold_standard.indels.b37.vcf \
-O GATK/GATK4_CF0098_recal_data.table
```
```
[February 10, 2018 1:41:22 PM EET] org.broadinstitute.hellbender.tools.walkers.bqsr.BaseRecalibrator done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=387973120
org.broadinstitute.hellbender.exceptions.GATKException: Error querying file /media/esra/Kucuk Lab/esra-data/library/dbsnp_138.b37.vcf over interval 1:10295-10392 at org.broadinstitute.hellbender.engine.FeatureDataSource.refillQueryCache(FeatureDataSource.java:548) at org.broadinstitute.hellbender.engine.FeatureDataSource.queryAndPrefetch(FeatureDataSource.java:513) at org.broadinstitute.hellbender.engine.FeatureManager.getFeatures(FeatureManager.java:308) at org.broadinstitute.hellbender.engine.FeatureContext.getValues(FeatureContext.java:163) at org.broadinstitute.hellbender.engine.FeatureContext.getValues(FeatureContext.java:115) at org.broadinstitute.hellbender.engine.FeatureContext.getValues(FeatureContext.java:253) at org.broadinstitute.hellbender.tools.walkers.bqsr.BaseRecalibrator.apply(BaseRecalibrator.java:185) at org.broadinstitute.hellbender.engine.ReadWalker.lambda$traverse$0(ReadWalker.java:96) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at org.broadinstitute.hellbender.engine.ReadWalker.traverse(ReadWalker.java:94) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:893) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:136) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:153) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195) at org.broadinstitute.hellbender.Main.main(Main.java:277)
Caused by: java.io.FileNotFoundException: /media/esra/Kucuk%20Lab/esra-data/library/dbsnp_138.b37.vcf (No such file or directory) at java.io.RandomAccessFile.open0(Native Method) at java.io.RandomAccessFile.open(RandomAccessFile.java:316) at java.io.RandomAccessFile.(RandomAccessFile.java:243) at htsjdk.samtools.seekablestream.SeekableFileStream.(SeekableFileStream.java:47) at htsjdk.samtools.seekablestream.SeekableStreamFactory$DefaultSeekableStreamFactory.getStreamFor(SeekableStreamFactory.java:99) at htsjdk.tribble.TribbleIndexedFeatureReader.getSeekableStream(TribbleIndexedFeatureReader.java:188) at htsjdk.tribble.TribbleIndexedFeatureReader.access$000(TribbleIndexedFeatureReader.java:56) at htsjdk.tribble.TribbleIndexedFeatureReader$QueryIterator.(TribbleIndexedFeatureReader.java:422) at htsjdk.tribble.TribbleIndexedFeatureReader.query(TribbleIndexedFeatureReader.java:296) at org.broadinstitute.hellbender.engine.FeatureDataSource.refillQueryCache(FeatureDataSource.java:544) … 27 more
```
Thanks,
From Sheila on 2018-02-12
@stochasticsdrj
Hi,
Sorry for the delay. I did not have emails come through when someone posted in a discussion, but now I do :smile:
The covariates argument was removed to get rid of some bad code. You can read more about it [here](https://github.com/broadinstitute/gatk/issues/185). I think there was some talk of possibly having it re-enabled, but I am not sure if there is a plan to actually do so.
-Sheila
From Sheila on 2018-02-12
@berutti
Hi,
—disable_indel_quals is the default now :smiley: Have a look [here]( https://github.com/broadinstitute/gatk/issues/1056) for more information.
-Sheila
From Sheila on 2018-02-12
@esrabytk
Hi,
This part of the log output:
`Caused by: java.io.FileNotFoundException: /media/esra/Kucuk%20Lab/esra-data/library/dbsnp_138.b37.vcf (No such file or directory)`
tells you that the tool cannot find the VCF file in the directory you specified.
-Sheila
From esrabytk on 2018-02-13
Hi @Sheila
I saw the same line in the error message and checked the file before write you. The problem is dbSNP file is present in that path and gatk3.8.0 can find it. I downloaded dbSNP file from gatk bundle again but the result did not change. I run both versions concurrently;
gatk-3.8.0
```
java -jar /home/esra/Documents/prgram_esra/GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -R library/human_g1k_v37_decoy.fasta \ -I GATK/IndelRealignment/CF0098_indelrealigner.bam \ -L library/targets.interval_list \ -knownSites library/dbsnp_138.b37.vcf \ -knownSites library/1000G_phase1.indels.b37.vcf \ -knownSites library/Mills_and_1000G_gold_standard.indels.b37.vcf \ -o GATK3_CF0098_recal_data.table
```
```
INFO 10:25:26,673 BaseRecalibrator – …done!
INFO 10:25:26,673 BaseRecalibrator – BaseRecalibrator was able to recalibrate 5845382 reads
INFO 10:25:26,673 ProgressMeter – done 5845382.0 6.9 m 70.0 s 100.0% 6.9 m 0.0 s
INFO 10:25:26,674 ProgressMeter – Total runtime 412.31 secs, 6.87 min, 0.11 hours
INFO 10:25:26,674 MicroScheduler – 227653 reads were filtered out during the traversal out of approximately 6073035 total reads (3.75%)
INFO 10:25:26,674 MicroScheduler – -> 0 reads (0.00% of total) failing BadCigarFilter
INFO 10:25:26,674 MicroScheduler – -> 0 reads (0.00% of total) failing DuplicateReadFilter
INFO 10:25:26,674 MicroScheduler – -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 10:25:26,674 MicroScheduler – -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO 10:25:26,675 MicroScheduler – -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO 10:25:26,675 MicroScheduler – -> 227653 reads (3.75% of total) failing MappingQualityZeroFilter
INFO 10:25:26,675 MicroScheduler – -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
INFO 10:25:26,675 MicroScheduler – -> 0 reads (0.00% of total) failing UnmappedReadFilter
—————————————————————————————————————————————
Done. There were no warn messages.
—————————————————————————————————————————————
```
gatk-4.0.1.2
```
gatk BaseRecalibrator \
-R library/human_g1k_v37_decoy.fasta \
-I CF0098_redup.bam \
—known-sites library/dbsnp_138.b37.vcf \
—known-sites library/1000G_phase1.indels.b37.vcf \
—known-sites library/Mills_and_1000G_gold_standard.indels.b37.vcf \
-O GATK4_CF0098_recal_data.table
```
```
10:19:49.062 INFO ProgressMeter – Starting traversal
10:19:49.062 INFO ProgressMeter – Current Locus Elapsed Minutes Reads Processed Reads/Minute
10:19:49.094 INFO BaseRecalibrator – Shutting down engine
[February 13, 2018 10:19:49 AM EET] org.broadinstitute.hellbender.tools.walkers.bqsr.BaseRecalibrator done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=301989888
org.broadinstitute.hellbender.exceptions.GATKException: Error querying file /media/esra/Kucuk Lab/esra-data/library/dbsnp_138.b37.vcf over interval 1:10295-10392 at org.broadinstitute.hellbender.engine.FeatureDataSource.refillQueryCache(FeatureDataSource.java:548) at org.broadinstitute.hellbender.engine.FeatureDataSource.queryAndPrefetch(FeatureDataSource.java:513) at org.broadinstitute.hellbender.engine.FeatureManager.getFeatures(FeatureManager.java:308) at org.broadinstitute.hellbender.engine.FeatureContext.getValues(FeatureContext.java:163) at org.broadinstitute.hellbender.engine.FeatureContext.getValues(FeatureContext.java:115) at org.broadinstitute.hellbender.engine.FeatureContext.getValues(FeatureContext.java:253) at org.broadinstitute.hellbender.tools.walkers.bqsr.BaseRecalibrator.apply(BaseRecalibrator.java:185) at org.broadinstitute.hellbender.engine.ReadWalker.lambda$traverse$0(ReadWalker.java:96) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at org.broadinstitute.hellbender.engine.ReadWalker.traverse(ReadWalker.java:94) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:893) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:136) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:153) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195) at org.broadinstitute.hellbender.Main.main(Main.java:277)
Caused by: java.io.FileNotFoundException: /media/esra/Kucuk%20Lab/esra-data/library/dbsnp_138.b37.vcf (No such file or directory) at java.io.RandomAccessFile.open0(Native Method) at java.io.RandomAccessFile.open(RandomAccessFile.java:316) at java.io.RandomAccessFile.(RandomAccessFile.java:243) at htsjdk.samtools.seekablestream.SeekableFileStream.(SeekableFileStream.java:47) at htsjdk.samtools.seekablestream.SeekableStreamFactory$DefaultSeekableStreamFactory.getStreamFor(SeekableStreamFactory.java:99) at htsjdk.tribble.TribbleIndexedFeatureReader.getSeekableStream(TribbleIndexedFeatureReader.java:188) at htsjdk.tribble.TribbleIndexedFeatureReader.access$000(TribbleIndexedFeatureReader.java:56) at htsjdk.tribble.TribbleIndexedFeatureReader$QueryIterator.(TribbleIndexedFeatureReader.java:422) at htsjdk.tribble.TribbleIndexedFeatureReader.query(TribbleIndexedFeatureReader.java:296) at org.broadinstitute.hellbender.engine.FeatureDataSource.refillQueryCache(FeatureDataSource.java:544)
```
I delete the known-sites dbSNP line from my code (gatk-4.0.1.2 ) and run again to see whether it will work, this time I got the same error message for indel files while they are also present at right path.
Thanks
From Sheila on 2018-02-14
@esrabytk
Hi,
Can you to submit a bug report? Instructions are [here](https://software.broadinstitute.org/gatk/documentation/article?id=11021).
Thanks,
Sheila
From esrabytk on 2018-02-14
Hi @Sheila
I submitted the bug report according to instructions. I would like to remind that I downloaded all reference files from gatk bundle and therefore I did not upload them.
Folder name: GATK-4.0.1.2_cannot find reference file.tar.gz
Thank you so much for your help :)
From SkyWarrior on 2018-02-20
Could you try giving the full path of the file for each vcf? It is possible that there is a difference between 4.x and 3.x regarding the way they handle absolute paths.
From esrabytk on 2018-02-20
Hi @SkyWarrior
Thanks for the suggestion, it works! :)
From Sheila on 2018-02-20
esrabytk
SkyWarrior
Hi,
Thanks for letting me know. I did not realize this could be the issue. Glad it works and I don’t have to submit a bug report :smiley:
-Sheila
From rourich on 2018-05-04
Hi,
I am using BaseRecalibrator for the Germline short variant discovery (SNPs + Indels) workflow. However, I don’t know what know sites should I use.
Since I have mapped my reads against GRCh38 from [here](http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/ “here”), I understand that I should download the know sites files from [this bundle](ftp://ftp.broadinstitute.org/bundle/hg38/ “this bundle”). However, [this article](https://software.broadinstitute.org/gatk/documentation/article.php?id=1247 “this article”) recommends the use of “1KG indels” which in fact is not stored within the hg38 bundle (dbSNP and “Mills indels” are stored in the hg38 bundle).
I realize that “1KG indels” file is stored in [b37 bundle](ftp://ftp.broadinstitute.org/bundle/b37/ “b37 bundle”). However, b37 is another reference…
I would like you to confirm me if using only dbSNP and “Mills indels” as know sites for BaseRecalibrator with hg38 is correct and that I don’t have to use “1KG indels” so that my results don’t have false positives.
Thank you very much, best regards
From Sheila on 2018-05-09
@rourich
Hi,
For hg38, it looks like Mills and 1000Genomes are included in the same file (Mills_and_1000G_gold_standard.indels.hg38.vcf.gz). Sorry for the confusion.
-Sheila
From ekanterakis on 2018-05-23
Hello, is there a tool that generates the beautiful before and after BQSR plots? I had a look at `AnalyzeCovariates` (GATK v4.0.1.1) but couldn’t figure out how to do before vs after. That report seems to be plotting the before only? (judging by the qscore bins).
I’m running using `gatk AnalyzeCovariates -bqsr recal_data.csv -plots recal_data.pdf`
where `recal_data.csv` is the output of `BaseRecalibrator`. Thanks a lot!
From ekanterakis on 2018-05-29
OK! Apparently you need to run BQSR twice, once on the original bam and again on the BQSR’ed bam and then run `AnalyzeCovariates -before original.recal_data.csv -after bqsr.recal_data.csv -plots before_vs_after.pdf`. Thanks!
From niuguohao on 2018-06-27
Hello, how can I generate the plots in **Examples of pre- and post-recalibration metrics **?
By using AnalyzeCovariates ?
Thanks!
From Sheila on 2018-06-28
@niuguohao
Hi,
Yes, that is correct.
-Sheila
From MasMarius on 2018-07-27
Hi,
I got a problem now after sorted and marked my reads, launching BQSR on them follows the same OUTPUT message with nothing as a result.
attached stdout.txt
Any suggestion?
Thanks in advance!
Regards.
From MasMarius on 2018-07-31
Hi,
I solved it finally. I dont know why, but changing the data used as known-sites, .FAI and .DICT files solved it and runs perfectly.
From kmmahan on 2018-08-08
I’m not working with the human genome. I’m working with algae. I have done all the preprocessing steps except Base Recalibration. I’m confused about what “known sites” to use since I don’t use dbSNP or anything else with a list of known variants. Maybe I don’t need to recalibrate since there is no databases available with algal variants? Can I just go on to HaploTypeCaller?
From Sheila on 2018-08-13
@kmmahan
Hi,
Have a look under “No excuses” in this article.
-Sheila
From kmmahan on 2018-08-13
How do you take the variants with the highest confidence to make the database of known variants? Just by hard filtering? What quality scores/parameters should be used as cutoff?
From kmmahan on 2018-08-13
Would this also be used as the “resource dataset” for VQSR?
From Sheila on 2018-08-15
@kmmahan
Hi,
I think [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/4562/questions-about-filtering-variants-manually) and [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/comment/22291) will help with your first question. Also, search the forum for “bqsr hard filtering” and you may get some more helpful threads :smile:
I think [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/comment/25693#Comment_25693) will help with your second question.
-Sheila
From mattwell on 2018-08-28
Hi,
I am trying to generate the 2ed recalibration table for AnalyzeCovariates, but the augment -bqsr required to input the 1st recalibration table in BaseRecalibrator does not seem to be present in the tools documentation and I am getting a user error for that augment.
Can I simply run BaseRecalibrator on the recalibrated bam from the ApplyBQSR output and use this recaliabration table for before and after plots in AnalyzeCovariates?
I am using gatk-4.0.3.0.
From Laurent_MeneSaffrane on 2018-08-31
August 31, 2018.
Dear GATK experts,
I am contacting you regarding the BQRS process that I ran on my files with both the GATK3 and GATK4 pipelines on the same data set.
GATK3 pipeline:
fastq.gz —> bwa mem —> MarkDuplicates —> bam ready for BQSR
HC on the original file (no snp_db) —> variants 1 —> SelectVariants —> SNP 1 + Indel 1 —> VariantFiltration —> SNP 2 + Indel 2
BQSR with SNP 2 + Indel 2 —> recal_table_1 (before recalibration of the bam file)
BQSR with SNP 2 + Indel 2 + recal_table_1 (-bqsr argument; recal on-the-fly) —> recal_table_2 (after recalibration)
AnalyzeCovariates -before recal_table_1 -after recal_table_2 —> Recalibration plots 1 (GATK3 pipeline)
GATK4 pipeline:
bwa mem —> MarkDuplicates —> bam ready for BQSR
HC on the original bam file (no snp_db) —> variants 1 —> SelectVariants —> SNP 1 + Indel 1 —> VariantFiltration —> SNP 2 + Indel 2
BQSR with SNP 2 + Indel 2 —> recal_table_1 (before recalibration of the bam file)
ApplyBQSR to the original bam file with recal_table_1 —> recalibrated bam file
HC on the recalibrated bam file —> variants 2 —> SelectVariants —> SNP 3 + Indel 3 —> VariantFiltration —> SNP 4 + Indel 4
BQSR with SNP 4 + Indel 4 —> recal_table_2 (after recalibration of the bam file; no recalibration on-the-fly with GATK4)
AnalyzeCovariates -before recal_table_1 -after recal_table_2 —> Recalibration plots 1 (GATK4 pipeline)
The GATK3 pipeline seems to produce a significant recalibration of the original data whereas the GATK4 pipeline does not seem to affect much the original data. Do you know why I observe a such difference between both recalibration pipelines? Alternatively, why the GATK4 pipeline barely recalibrate base quality scores compared to the GATK3 one?
Many thanks to all of you for your help and support.
Best, Laurent
Recalibration_ plots_GATK4.pdf
From Sheila on 2018-09-05
@mattwell
Hi,
Have a look at Soo Hee’s answer [here](https://gatkforums.broadinstitute.org/gatk/discussion/comment/43986#Comment_43986).
-Sheila
From Sheila on 2018-09-06
@Laurent_MeneSaffrane
Hi Laurent,
I know there were some changes made to BQSR tools in GATK4, but I am not sure if those are related to your differences. Can you tell us if you get the exact same variants from HaplotypeCaller with each of the recalibrated BAM files?
Thanks,
Sheila
From josepruden on 2018-09-27
In these videos, I’ll walk you through the entire Green Force Forskolin process. There is an limited supply of Green Force Forskolin. If you aren’t careful your [Green Force Forskolin](http://supplementtalks.com/green-force-forskolin/ “Green Force Forskolin”) will go down in quality although that’s catchy.
Do not forget to read here>> http://supplementtalks.com/green-force-forskolin/
From Angry_Panda on 2019-03-09
Hi,
I read “BQSR using machine learning method”, but no more detailed info. Do we have some reference material/paper for this?
I searched, found some paper talk about VQSR but not BQSR.
Best.
From lidia on 2019-07-31
Hi GATK team!
I am working on a fish where I don´t have a database of know SNPs nor of indels. I am working with a total of 394 individuals. This means that I have 394 WGS samples and I would like to use the GVCF workflow.
According, to what I have read, I need to create such lists with my own data. However, I have a couple of questions regarding the pipeline to achieve this.
1) In order to generate my list of SNPs and INDELs that will be provided as input for Base Quality Score Recalibration, should I use the Haplotypecaller in normal mode? Or should I use the GVCF mode in this first round of the Haplotypecaller?
2) Since calling the variants will be done per sample, at the end I would have a total of 394 output files.
Should I combine them all together and keep only the high quality variants, so that at the end I have only one file of SNPs and one of INDELs to use for all the 394 samples? Or should each sample be recalibrated with its own set of SNPs and INDELs?
Many thanks to all of you for your help and support.
Lidia
From wbsimey on 2019-07-31
We work with many non-model vertebrates and we are trying to establish a workflow that includes the generation of known-sites lists for BQSR analyses. We have been reading the hundreds of forum threads and documentation, but we still do not have a clear picture. I have made a rough diagram to help explain what we have so far and where we are stuck.
There are three places where we are unsure:
1. For GenotypGVCFs, we replaced the selectVariants tool (select-type SNP and Indel) with the GenotypGVCFs tool, but we are not sure based on the documentation if this selects SNPs and Indels.
2. Should we use the —excludeFiltered option to generate the “known-sites” tables or leave it for the last step as shown in our graphic?
3. For the ApplyBQSR step, the documentation indicates that the output file is a bam file. It is unclear to us how to use this file. Our understanding is that the next step is to rerun HaplotypeCaller with the BQSR option, but that is no longer a GATK4 option. This is currently our biggest sticking point’ we cannot figure out what to do here. How does this bam file feed into Haplotypecaller? and how do we tell it that we have BQSR information?
4. Does this workflow look correct?

Once we get this non-model organism workflow working and optimized, we will make it publicly available. I can tell from the many threads about this issue that it would be very useful.
From wbsimey on 2019-08-04
In my above diagram, I have a single analysis for BaseRecalibrator and ApplyBQSR on the merged data. I think this might be incorrect. Do I need to rerun every bam sample through these two BQSR steps? For e.g., if I have 100 low coverage genomes, each with a bam file generated by the bwa mem steps, I have to rerun every 100 bam files through BaseRecalibrator and ApplyBQSR and then feed each new bam file generated by ApplyBQSR back through HaplotypeCaller?
It should look more like this?
