At what point should I merge read group BAM files belonging to the same sample into a single file

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Sheila

on 2015-08-26

It is fairly common to have multiple read groups for a sample, either from sequencing multiple libraries or from spreading a library across multiple lanes. It seems this causes a lot of confusion, and people often tell us they're not sure how to organize the data for the pre-processing steps or how to feed the data into HaplotypeCaller.

Well, there are several options for organizing the processing. We have a fairly detailed FAQ article that describes our preferred workflow for pre-processing data from multiplexed sequencing and multi-library designs. But in this article we describe at a simpler level what are the main two options depending on how you want to provide the analysis ready BAM files to the variant caller.

To produce a combined per-sample bam file to feed to HaplotypeCaller (most common)

The simplest thing to do is to input all the bam files that belong to that sample, either at the MarkDuplicates step, the Indel Realignment step or at the BQSR step. The choice depends mostly on how deep the coverage is. High depth means a lot of data to process at the same time, which slows down Indel Realignment. This is because Indel Realignment ignores all read group information and simply processes all reads together. BQSR doesn't suffer from that problem because it processes read groups separately. In either case, when you input all samples together, the bam that gets written out with the processed data will include all the libraries / read groups in one handy per-sample file.

Note: We do not require the PU field in the RG, however, BQSR will consider the PU field over all other fields.

To produce a separate bam file for each read group (less common)

Another option is to keep all the bam files separate until variant calling, and then input them to Haplotype Caller together. You can do this by simply running Indel Realignment and BQSR on each of the bams separately. You can then input all of the bams into HaplotypeCaller at once. This works even if you want to run HaplotypeCaller in GVCF mode, which can only be done on a single sample at a time. As long as the SM tags are identical, HaplotypeCaller will recognize that it's a single-sample run. This is because the GATK engine will merge the data before presenting it to the HaplotypeCaller tool, so HaplotypeCaller does not know nor care whether the data came from many files or one file.

Note: If you input many bam files into Indel Realigner, the default output is one bam file. However, you can output one bam file for each input bam file by using -nWayOut.

Updated on 2016-04-30

From everestial007 on 2015-10-29

Thanks @Sheila : This is very helpful.

From ibseq on 2017-02-13

hi,

im quite confused on how to feed HC with several bam files in the input.

I have a bam file for each of my differerent samples: can I merged all the bams in one file, or what is the other way? do i need to run this command for each of my over 100 samples?:

java -jar GenomeAnalysisTK.jar \ -R reference.fasta \ -T HaplotypeCaller \ -I sample1.bam \ —emitRefConfidence GVCF \ [—dbsnp dbSNP.vcf] \ [-L targets.interval_list] \ -o output.raw.snps.indels.g.vcf

thanks,

ibseq

From Geraldine_VdAuwera on 2017-02-13

@ibseq This command must be run individually per-sample (no merging). Then afterward you run GenotypeGVCFs on all the GVCFs together.

From roshabey on 2018-04-18

Hi Sheila or Geraldine_VdAuwera ,

could you elaborate a bit more how you can combine information from multiple bam files for a sample during the BQSR step in GATK4. Just want to know how the syntax works etc…

Wondered whether a more recent page with more information regarding this.

Appreciate it.

Thanks much.

From Geraldine_VdAuwera on 2018-04-18

@roshabey See the pipeline script section starting at https://github.com/gatk-workflows/gatk4-data-processing/blob/master/processing-for-variant-discovery-gatk4.wdl#L171, where we scatter the BaseRecalibrator then combine the recalibration tables using [GatherBqsrReports](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_bqsr_GatherBQSRReports.php).

From roshabey on 2018-04-18

Thanks @Geraldine_VdAuwera . Can you assign 2 or more bam files at the ApplyBQSR stage?

I will read the script too.

From Geraldine_VdAuwera on 2018-04-18

I think that should work but can’t guarantee it — best thing to do is run a small test. Let us know how it goes :)

From CarolineJ on 2018-07-26

Helpful article. Thanks @Sheila !

From Lufpa on 2019-09-03

Hi,

Regarding the second option “To produce a separate bam file for each read group”. I’m wondering if you still advice to do this (the article is from 2016, so I’m not sure)? If MarkDuplicates is run on the individual bams (per lane), and then recalibrated, the recalibrated.bam files fed to HaplotypeCaller are gonna have an underestimated number of duplicates (because the whole sample was never marked as a whole, but in independent runs per lane — see the number shown by Falker in this thread https://gatkforums.broadinstitute.org/gatk/discussion/5861/the-order-of-merge-and-mark-duplicate).

I’m getting re-sequencing data from old samples I have processed already, and going for the second option of treating separate bam files independently until HaplotypeCaller will save me a lot of time. However, after thinking about the underestimation of duplicates, I’m not sure if this is a correct path to take.

Thanks for your help,

Luisa