Read groups

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by GATK_Team

on 2017-12-24

There is no formal definition of what is a read group, but in practice, this term refers to a set of reads that were generated from a single run of a sequencing instrument.

In the simple case where a single library preparation derived from a single biological sample was run on a single lane of a flowcell, all the reads from that lane run belong to the same read group. When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate read group.

Read groups are identified in the SAM/BAM /CRAM file by a number of tags that are defined in the official SAM specification. These tags, when assigned appropriately, allow us to differentiate not only samples, but also various technical features that are associated with artifacts. With this information in hand, we can mitigate the effects of those artifacts during the duplicate marking and base recalibration steps. The GATK requires several read group fields to be present in input files and will fail with errors if this requirement is not satisfied. See this article for common problems related to read groups.

To see the read group information for a BAM file, use the following command.

samtools view -H sample.bam | grep '^@RG'

This prints the lines starting with @RG within the header, e.g. as shown in the example below.

@RG ID:H0164.2 PL:illumina PU:H0164ALXX140820.2 LB:Solexa-272222 PI:0 DT:2014-08-20T00:00:00-0400 SM:NA12878 CN:BI

Meaning of the read group fields required by GATK

ID = Read group identifier This tag identifies which read group each read belongs to, so each read group's ID must be unique. It is referenced both in the read group definition line in the file header (starting with @RG) and in the RG:Z tag for each read record. Note that some Picard tools have the ability to modify IDs when merging SAM files in order to avoid collisions. In Illumina data, read group IDs are composed using the flowcell name and lane number, making them a globally unique identifier across all sequencing data in the world. Use for BQSR: ID is the lowest denominator that differentiates factors contributing to technical batch effects: therefore, a read group is effectively treated as a separate run of the instrument in data processing steps such as base quality score recalibration (unless you have PU defined), since they are assumed to share the same error model.
PU = Platform Unit The PU holds three types of information, the {FLOWCELLBARCODE}.{LANE}.{SAMPLEBARCODE}. The {FLOWCELLBARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLEBARCODE} is a sample/library-specific identifier. Although the PU is not required by GATK but takes precedence over ID for base recalibration if it is present. In the example shown earlier, two read group fields, ID and PU, appropriately differentiate flow cell lane, marked by .2, a factor that contributes to batch effects.
SM = Sample The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample, and this is also the name that will be used for the sample column in the VCF file. Therefore it's critical that the SM field be specified correctly. When sequencing pools of samples, use a pool name instead of an individual sample name.
PL = Platform/technology used to produce the read This constitutes the only way to know what sequencing technology was used to generate the sequencing data. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.
LB = DNA preparation library identifier MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

If your sample collection's BAM files lack required fields or do not differentiate pertinent factors within the fields, use Picard's AddOrReplaceReadGroups to add or appropriately rename the read group fields as outlined here.

Deriving ID and PU fields from read names

Here we illustrate how to derive both ID and PU fields from read names as they are formed in the data produced by the Broad Genomic Services pipelines (other sequence providers may use different naming conventions). We break down the common portion of two different read names from a sample file. The unique portion of the read names that come after flow cell lane, and separated by colons, are tile number, x-coordinate of cluster and y-coordinate of cluster.

H0164ALXX140820:2:1101:10003:23460 H0164ALXX140820:2:1101:15118:25288

Breaking down the common portion of the query names:

H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell _____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run _______________:2 #portion of @RG ID and PU fields indicating flow cell lane

Multi-sample and multiplexed example

Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:

Dad's data: @RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400 @RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400 Mom's data: @RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200 @RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200 @RG ID:FLOWCELL1.LANE7 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400 @RG ID:FLOWCELL1.LANE8 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400 Kid's data: @RG ID:FLOWCELL2.LANE1 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE2 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE3 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400 @RG ID:FLOWCELL2.LANE4 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

Updated on 2019-08-23

From santiagorevale on 2018-05-31

Hi there,

I hope you could help me unravelling this, because I always found confusing the definitions and examples for ID and PU.

In the beginning of the document it says

> In the simple case where a single library preparation derived from a single biological sample was run on a single lane of a flowcell, all the reads from that lane run belong to the same read group. When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate read group.

But then, in the middle we have

> ID = Read group identifier

> This tag identifies which read group each read belongs to, so each read group’s ID must be unique. It is referenced both in the read group definition line in the file header (starting with @RG) and in the RG:Z tag for each read record. Note that some Picard tools have the ability to modify IDs when merging SAM files in order to avoid collisions. In Illumina data, read group IDs are composed using the flowcell + lane name and number, making them a globally unique identifier across all sequencing data in the world.

According to the previous paragraph, two samples multiplexed in the same lane will have the same RG ID: `flowcell + lane`. Ergo, it won’t be uniquely identifying each set of reads.

On the other hand, we have the definition of PU as follows:

> PU = Platform Unit

> The PU holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. The {FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLE_BARCODE} is a sample/library-specific identifier.

According to this definition, PU is described just as how ID was expected to be working: uniquely identifying each set of reads, `flowcell + lane + barcode`.

Besides, the name of the RG `Platform Unit` sounds like it should be here where `flowcell + lane` should be specified, because it’s actually referencing what a unit is for Illumina.

So, shouldn’t this two definitions be swapped/fix/amend?

Finally, regarding the `Multi-sample and multiplexed example`, could you extend the example so that it contains several samples multiplexed together and sequenced in several lanes? In e.g.

```

Flowcell 1, Lanes 1-6

Dad 200 bp (barcode 1)

Mom 200 bp (barcode 2)

Kid 200 bp (barcode 3)

Flowcell 1, Lanes 7-8

Dad 400 bp (barcode 4)

Mom 400 bp (barcode 5)

Kid 400 bp (barcode 6)

Flowcell 2, Lanes 1-4

Dad 400 bp (barcode 4)

Mom 400 bp (barcode 5)

Kid 400 bp (barcode 6)

```

In this example, barcodes 1, 2 and 3 are pooled together, and this same pool in then sequenced across 6 lanes. The same is done with barcodes 4, 5 and 6. I believe an example like this will for sure help me understand how this is intended to look like and what is it that I’m missing or overseeing.

Thank you very much in advanced for your help.

Best regards,

Santiago

From Sheila on 2018-06-10

@santiagorevale

Hi Santiago,

>According to the previous paragraph, two samples multiplexed in the same lane will have the same RG ID: flowcell + lane. Ergo, it won’t be uniquely identifying each set of reads.

In this case, the SM tag will distinguish between the two different samples.

The only extra information the PU field may contain is the barcode or index in a multiplexed run. The PU field is not required by GATK. It is true some information is redundant.

>Finally, regarding the Multi-sample and multiplexed example, could you extend the example so that it contains several samples multiplexed together and sequenced in several lanes?

Can you post what you think would be the correct way to label the read group? :smile:

Thanks,

Sheila

From Wubalem on 2019-07-18

Hi there,

I am new to the platform and working on WGS data sequenced using MGISEQ2000. I have come to understand from a colleague of mine in order to have a smooth downstream analysis using GATK3.7 the PL should remain as ‘\tPL:Illumina’ without specifying the type of sequencer used when assigning read group fields. So my question is since I am using GATK4 do I need to specify the platform name as ‘\tPL:MGISEQ2000’ or follow the suggestion given by my colleague?

Thanks,

Report abuse