howto Generate a bamout file showing how HaplotypeCaller has remapped sequence reads

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2015-04-26

NOTE: Although this document refers to HaplotypeCaller, you can generate the bamout from MuTect2 as well because the tool performs a similar reassembly to HaplotypeCaller.

1. Overview

As you may know, HaplotypeCaller performs a local reassembly and realignment of the reads in the region surrounding potential variant sites (see the HaplotypeCaller method docs for more details on why and how this is done). So it often happens that during the calling process, the reads get moved to different mapping positions than what you can observe in the BAM file that you originally provided to HC as input.

These remappings usually explain most discordances between calls that are expected based on the original data and actual calls made by HaplotypeCaller, so it's very useful to be able to visualize what rearrangements the tool has made.

Please note: The bamout file cannot be generated when using -nt or -nct.

2. Generating the bamout for a single site or interval

To generate the bamout file for a specific site or interval, just run HaplotypeCaller on the region around the site or interval of interest using the -L argument to restrict the analysis to that region (adding about 500 bp on either side) and using the -bamout argument to specify the name of the bamout file that will be generated.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I recalibrated.bam -o hc_variants.vcf -L 20:10255630-10255840 -bamout bamout.bam

If you were using any additional parameters in your original variant calling (including -ERC and related arguments), make sure to include them in this command as well so that you can make an apples-to-apples comparison.

Then you open up both the original bam and the bamout file together in a genome browser such as IGV. On some test data from our favorite sample, NA12878, this is what you would see:

You can see that the bamout file, on top, contains data only for the ActiveRegion that was within the analysis interval specified by -L. The two blue reads represent the artificial haplotypes constructed by HaplotypeCaller (you may need to adjust your IGV settings to see the same thing on your machine).

You can see a whole group of reads neatly aligned, with an insertion in the middle. In comparison, the original data shown in the lower track has fewer reads with insertions, but has several reads with mismapped ends. This is a classic example of a site where realignment through reassembly has provided additional evidence for an indel, allowing HaplotypeCaller to call it confidently. In contrast, UnifiedGenotyper was not able to call this insertion confidently.

3. Generating the bamout for multiple intervals or the whole genome

Although we don't recommend doing this by default because it will cause slower performance and take up a lot of storage space, you can generate a bamout that contains many more intervals, or even covers the whole genome. To do so, just run the same command, but this time, pass your list of intervals to -L, or simply omit it if you want the entire genome to be included.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I recalibrated.bam -o hc_variants.vcf -bamout bamout.bam

This time, if you zoom out a bit in IGV, you will see multiple stacks of reads corresponding to the various ActiveRegions that were identified and processed.

4. Forcing an output in a region that is not covered in the bamout

In some cases HaplotypeCaller does not complete processing on an ActiveRegion that it has started. This is typically because there is either almost no evidence of variation once the remapping has been done, or on the contrary, the region is very messy and there is too much complexity. In both cases, the program is designed to give up in order to avoid wasting time. This is a good thing most of the time, but it does mean that sometimes you will have no output in the bamout for the site you are trying to troubleshoot.

The good news is that in most cases it is possible to force HaplotypeCaller to go through with the full processing so that it will produce bamout output for your site of interest. To do so, simply add the flags -forceActive and -disableOptimizations to your command line, in addition to the -bamout argument of course.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I recalibrated.bam -L 20:10371667-10375021 -o hc_forced.vcf -bamout force_bamout.bam -forceActive -disableOptimizations

In this other region, you can see that the original mapping (middle track) was a bit messy with some possible evidence of variation, and in fact UnifiedGenotyper called a SNP in this region (top variant track). But HaplotypeCaller did not call the SNP, and did not output anything in our first bamout file (top read track). When you force an output in that region using the two new flags, you see in the forced bamout (bottom read track) that the remapped data is a lot cleaner and the evidence for variation is essentially gone.

It is also possible to force an ActiveRegion to be triggered at specific intervals; see the HaplotypeCaller tool docs for more details on how this is done.

Updated on 2017-05-01

From mscjulia on 2016-07-25

Thank you for the thorough explanation. I’m wondering if this feature is more useful to detect indels than SNVs, and if I could skip it if I’m only insterested in SNVs of WGS please. Thank you.

From Sheila on 2016-07-26

@mscjulia

Hi,

The bamout file is useful for debugging purposes. Usually users use it to determine why a call was made or not made. You don’t need to use it in your regular command line as it increases compute time.

-Sheila

From armarkes on 2016-11-22

Hi @Sheila

I used HaplotypeCaller to call variants (in -ERC mode) and I found a strange output for some variants.

Here is an example oF a g.vcf for one of the samples:

As you can see, these are 3 consecutive positions 73002035, 73002036 and 73002039. I looked to this position in the BAM file and I confirmed that all reads that covered one of these sites also covered the other 2. So it was expected for these 3 positions to have the same DP value. Surprisingly, the position with only the REF allele had DP=196; while the other two positions had DP=8 and DP=15.

This output is really strange and leads to errors in the variant interpretation, because for the position 73002035 we have 196 reads, and it indicates an homozigoty of ALT allele. However in gvcf we got the information of only DP=8 (possible heterozigoty).

Can you explain why this happen?

Thanks,

Ana

Here is the code I used to produce the g.vcf:

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa -I bam —genotyping_mode DISCOVERY —dbsnp dbsnp_138.hg19.vcf -ERC GVCF -L chr10 -L chr14 -L chr15 -o g.vcf

From Sheila on 2016-11-23

@armarkes

Hi Ana,

Can you post an IGV screenshot of the [bamout file](https://software.broadinstitute.org/gatk/documentation/article?id=5484)? Please include 100 bases before and after the site of interest. Are you using the latest version of GATK? Also, it looks like you are using amplicon data. Have a look at [this thread](http://gatkforums.broadinstitute.org/gatk/discussion/5855/why-is-haplotypecaller-not-calling-this-sanger-sequencing-confirmed-variant) as well.

Thanks,

Sheila

From mmokrejs on 2016-12-07

It would have been helpful if this article stated that I cannot obrain the bamout file when using multiple threads or even data threads.

ERROR A USER ERROR has occurred (version 3.6-0-g89b7209):

...

ERROR MESSAGE: Invalid command line: Argument bamout has a bad value: Currently cannot emit a BAM file from the HaplotypeCaller in multi-threaded mode.

From Sheila on 2016-12-08

@mmokrejs

Hi,

Thanks for the suggestion. I just added a note.

-Sheila

From xiucz on 2018-11-14

Hi,

In the 2nd section，

> The two blue reads represent the artificial haplotypes constructed by HaplotypeCaller (you may need to adjust your IGV settings to see the same thing on your machine).

“The two blue reads represent the artificial haplotypes”, do I missed something? I remembered that IGV uses color coding to flag anomalous insert sizes, but here, it stands for a new meaning?

Thank you.

Report abuse