How to Fix a badly formatted BAM

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2013-07-03

Fix a BAM that is not indexed or not sorted, has not had duplicates marked, or is lacking read group information. The options on this page are listed in order of decreasing complexity.

You may ask, is all of this really necessary? The GATK imposes strict formatting guidelines, including requiring certain read group information, that other software packages do not require. Although this represents a small additional processing burden upfront, the downstream benefits are numerous, including the ability to process library data individually, and significant gains in speed and parallelization options.

Prerequisites

    • Installed Picard tools
    • If indexing or marking duplicates, then coordinate sorted reads
    • If coordinate sorting, then reference aligned reads
    • For each read group ID, a single BAM file. If you have a multiplexed file, separate to individual files per sample.

Jump to a section on this page

    1. Add read groups, coordinate sort and index using AddOrReplaceReadGroups
    2. Coordinate sort and index using SortSam
    3. Index an already coordinate-sorted BAM using BuildBamIndex
    4. Mark duplicates using MarkDuplicates

Tools involved

Related resources

    • Our dictionary entry on read groups discusses the importance of assigning appropriate read group fields that differentiate samples and factors that contribute to batch effects, e.g. flow cell lane. Be sure that your read group fields conform to the recommendations.
    • Picard's standard options includes adding CREATE_INDEX to the commands of any of its tools that produce coordinate sorted BAMs.

1. Add read groups, coordinate sort and index using AddOrReplaceReadGroups

Use Picard's AddOrReplaceReadGroups to appropriately label read group (@RG) fields, coordinate sort and index a BAM file. Only the five required @RG fields are included in the command shown. Consider the other optional @RG fields for better record keeping.

java -jar picard.jar AddOrReplaceReadGroups \ INPUT=reads.bam \ OUTPUT=reads_addRG.bam \ RGID=H0164.2 \ #be sure to change from default of 1 RGLB= library1 \ RGPL=illumina \ RGPU=H0164ALXX140820.2 \ RGSM=sample1 \

This creates a file called reads_addRG.bam with the same content and sorting as the input file, except the SAM record header's @RG line will be updated with the new information for the specified fields and each read will now have an RG tag filled with the @RG ID field information. Because of this repetition, the length of the @RG ID field contributes to file size.

To additionally coordinate sort by genomic location and create a .bai index, add the following options to the command.

SORT_ORDER=coordinate \ CREATE_INDEX=true

To process large files, also designate a temporary directory.

TMP_DIR=/path/shlee #sets environmental variable for temporary directory

2. Coordinate sort and index using SortSam

Picard's SortSam both sorts and indexes and converts between SAM and BAM formats. For coordinate sorting, reads must be aligned to a reference genome.

java -jar picard.jar SortSam \ INPUT=reads.bam \ OUTPUT=reads_sorted.bam \ SORT_ORDER=coordinate \

Concurrently index by tacking on the following option.

CREATE_INDEX=true

This creates a file called reads_sorted.bam containing reads sorted by genomic location, aka coordinate, and a .bai index file with the same prefix as the output, e.g. reads_sorted.bai, within the same directory.

To process large files, also designate a temporary directory.

TMP_DIR=/path/shlee #sets environmental variable for temporary directory

3. Index an already coordinate-sorted BAM using BuildBamIndex

Picard's BuildBamIndex allows you to index a BAM that is already coordinate sorted.

java -jar picard.jar BuildBamIndex \ INPUT=reads_sorted.bam

This creates a .bai index file with the same prefix as the input file, e.g. reads_sorted.bai, within the same directory as the input file. You want to keep this default behavior as many tools require the same prefix and directory location for the pair of files. Note that Picard tools do not systematically create an index file when they output a new BAM file, whereas GATK tools will always output indexed files.

To process large files, also designate a temporary directory.

TMP_DIR=/path/shlee #sets environmental variable for temporary directory

4. Mark duplicates using MarkDuplicates

Picard's MarkDuplicates flags both PCR and optical duplicate reads with a 1024 (0x400) SAM flag. The input BAM must be coordinate sorted.

java -jar picard.jar MarkDuplicates \ INPUT=reads_sorted.bam \ OUTPUT=reads_markdup.bam \ METRICS_FILE=metrics.txt \ CREATE_INDEX=true

This creates a file called reads_markdup.bam with duplicate reads marked. It also creates a file called metrics.txt containing duplicate read data metrics and a .bai index file.

To process large files, also designate a temporary directory.

TMP_DIR=/path/shlee #sets environmental variable for temporary directory

    • During sequencing, which involves PCR amplification within the sequencing machine, by a stochastic process we end up sequencing a proportion of DNA molecules that arose from the same parent insert. To be stringent in our variant discovery, GATK tools discount the duplicate reads as evidence for or against a putative variant.
    • Marking duplicates is less relevant to targeted amplicon sequencing and RNA-Seq analysis.
    • Optical duplicates arise from a read being sequenced twice as neighboring clusters.

back to top

Updated on 2015-12-04

From Heidihuang on 2013-09-30

If I want to add more header information for example, HD, VN, GO,SO, SQ,SN,LN,RG,ID,LB,SM,CL how to do that

From Geraldine_VdAuwera on 2013-09-30

To add or modify read group information, use Picard’s AddOrReplaceReadGroup. For other header information you’ll have to look at other tools in Picard or samtools. GATK does not provide functions to do this.

From roger_d on 2015-01-24

This was very helpful to me, but I needed to make a few changes to get the examples to work:

In 2) the required input METRICS_FILE should be added, like this:

java -jar MarkDuplicates.jar \ INPUT=sorted_reads.bam \ OUTPUT=dedup_reads.bam \ METRICS_FILE=sorted_reads.dup

in 4) BuildBamIndex should be BuildBamIndex.jar

From Geraldine_VdAuwera on 2015-01-26

Thanks for pointing this out, @roger_d. I’ve made the necessary corrections.

Note for others: if you are using a very recent version of Picard, you may need to adapt the command lines to the new Picard syntax, which uses one monolithic jar to which you pass tool names (e.g. `-jar picard.jar BuildBamIndex` etc.), like we have in GATK.

From shlee on 2015-11-25

I’ve recently revamped this page and it is pending review.

From Amanda on 2016-02-18

Has there been a change in the options for adding read groups? I am getting an error where it says it does not recognize the option RGID. Thank you!

From Geraldine_VdAuwera on 2016-02-18

I can’t think of anything major like that. Can you please post your full command and console output? And version too.

From Amanda on 2016-02-18

Hi Geraldine, Please disregard, I found a typo that had been integrated somehow into a saved script that was causing the error. Thank you!

From MUHAMMADSOHAILRAZA on 2016-07-26

@Geraldine_VdAuwera

Hi,

I have got some merged BAMs with RG information looks like this:

RG ID:LP6005619-DNA_A01.L2 LB:LP6005619-DNA-PE PL:HiSeq2000 SM:LP6005619-DNA_A01 PU:HiSeq2000 CN:Illumina-ltd

RG ID:LP6005619-DNA_A01.L3 LB:LP6005619-DNA-PE PL:HiSeq2000 SM:LP6005619-DNA_A01 PU:HiSeq2000 CN:Illumina-ltd

@RG ID:LP6005619-DNA_A01.L4 LB:LP6005619-DNA-PE PL:HiSeq2000 SM:LP6005619-DNA_A01 PU:HiSeq2000 CN:Illumina-ltd

Here we have

SAME ‘PU’ and ‘PL’ values.

UNIQUE ‘ID’ values.

Since read groups from each are assigned with unique RGIDs, Would this info impact while Marking duplicates in PICARDS and/or BQSR in GATK?

Thanks!

From Sheila on 2016-08-01

@MUHAMMADSOHAILRAZA

Hi,

MarkDuplicates takes into account library information. So, because all of your reads come from the same library, they will be processed together.

As for BQSR, it takes into account PU over all other fields. So, all your reads will be processed togetehr in BQSR. However, it looks like your reads come from different lanes, so you should reflect that in your PU field.

-Sheila

P.S. Your PL should be ILLUMINA (we don’t distinguish machine types).

From MUHAMMADSOHAILRAZA on 2016-08-02

@Sheila

Thanks..

From Twesidave on 2019-05-06

If I already have a bam.bai file, do I need to index my bam file again after reformatting with AddReadGroups and MarkDuplicates?