Read groups

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2015-11-20

There is no formal definition of what is a read group, but in practice, this term refers to a set of reads that were generated from a single run of a sequencing instrument.

In the simple case where a single library preparation derived from a single biological sample was run on a single lane of a flowcell, all the reads from that lane run belong to the same read group. When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate read group.

Read groups are identified in the SAM/BAM /CRAM file by a number of tags that are defined in the official SAM specification. These tags, when assigned appropriately, allow us to differentiate not only samples, but also various technical features that are associated with artifacts. With this information in hand, we can mitigate the effects of those artifacts during the duplicate marking and base recalibration steps. The GATK requires several read group fields to be present in input files and will fail with errors if this requirement is not satisfied. See this article for common problems related to read groups.

To see the read group information for a BAM file, use the following command.

samtools view -H sample.bam | grep '@RG'

This prints the lines starting with @RG within the header, e.g. as shown in the example below.

@RG ID:H0164.2 PL:illumina PU:H0164ALXX140820.2 LB:Solexa-272222 PI:0 DT:2014-08-20T00:00:00-0400 SM:NA12878 CN:BI

Meaning of the read group fields required by GATK

ID = Read group identifier This tag identifies which read group each read belongs to, so each read group's ID must be unique. It is referenced both in the read group definition line in the file header (starting with @RG) and in the RG:Z tag for each read record. Note that some Picard tools have the ability to modify IDs when merging SAM files in order to avoid collisions. In Illumina data, read group IDs are composed using the flowcell + lane name and number, making them a globally unique identifier across all sequencing data in the world. Use for BQSR: ID is the lowest denominator that differentiates factors contributing to technical batch effects: therefore, a read group is effectively treated as a separate run of the instrument in data processing steps such as base quality score recalibration, since they are assumed to share the same error model.
PU = Platform Unit The PU holds three types of information, the {FLOWCELLBARCODE}.{LANE}.{SAMPLEBARCODE}. The {FLOWCELLBARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLEBARCODE} is a sample/library-specific identifier. Although the PU is not required by GATK but takes precedence over ID for base recalibration if it is present. In the example shown earlier, two read group fields, ID and PU, appropriately differentiate flow cell lane, marked by .2, a factor that contributes to batch effects.
SM = Sample The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample, and this is also the name that will be used for the sample column in the VCF file. Therefore it's critical that the SM field be specified correctly. When sequencing pools of samples, use a pool name instead of an individual sample name. Note, when we say pools, we mean samples that are not individually barcoded. In the case of multiplexing (often confused with pooling) where you know which reads come from each sample and you have simply run the samples together in one lane, you can keep the SM tag as the sample name and not the "pooled name".
PL = Platform/technology used to produce the read This constitutes the only way to know what sequencing technology was used to generate the sequencing data. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.
LB = DNA preparation library identifier MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

If your sample collection's BAM files lack required fields or do not differentiate pertinent factors within the fields, use Picard's AddOrReplaceReadGroups to add or appropriately rename the read group fields as outlined here.

Deriving ID and PU fields from read names

Here we illustrate how to derive both ID and PU fields from read names as they are formed in the data produced by the Broad Genomic Services pipelines (other sequence providers may use different naming conventions). We break down the common portion of two different read names from a sample file. The unique portion of the read names that come after flow cell lane, and separated by colons, are tile number, x-coordinate of cluster and y-coordinate of cluster.

H0164ALXX140820:2:1101:10003:23460 H0164ALXX140820:2:1101:15118:25288

Breaking down the common portion of the query names:

H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell _____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run _______________:2 #portion of @RG ID and PU fields indicating flow cell lane

Multi-sample and multiplexed example

Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:

Dad's data: @RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400 @RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400 Mom's data: @RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200 @RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200 @RG ID:FLOWCELL1.LANE7 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400 @RG ID:FLOWCELL1.LANE8 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400 Kid's data: @RG ID:FLOWCELL2.LANE1 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE2 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE3 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400 @RG ID:FLOWCELL2.LANE4 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

Updated on 2018-05-03

From LuisaB on 2015-12-04

Hi,

sorry, but this is still not clear to me.

My situation is the following: I have 475 individuals, sequenced in a unique run of a single instrument (same flowcell). On each lane of this run, 95 individuals were multiplexed (so I used 5 lanes). Each batch of 95 individuals all belong to the same library, so to be clear I have 5 libraries, each containing 95 individuals and each library was sequenced on a different lane of the same run (same flowcell).

I added the read group information as follows.

For example, for individual F008_01 belonging to library 1 and sequenced on lane 4:

RGID=lane4

RGPU=lane4

RGSM=ind_F008_01

RGPL=illumina

RGLB=lib1

For individual F033_01 belonging to library 2 and sequenced on lane 5:

RGID=lane5

RGPU=lane5

RGSM=ind_F033_01

RGPL=illumina

RGLB=lib2

Is this correct? Should I add also the name of the individual in ID or PU or is there a single read group for each lane, independently of how many individuals I have on a lane?

The header of the reads in the fastq file looks like this for F008_01 (lane 4):

HWI-ST0747:467:C59MAACXX:4:1101:1979:2080 1:N:0: and like this for F033_01 (lane 5):

HWI-ST0747:467:C59MAACXX:5:1101:16528:2235 1:N:0:

and the part before the lane information is the same for all lanes (@HWI-ST0747:467:C59MAACXX).

Therefore, does it matter if I don’t specify it neither in RGID nor in RGPU? It’s like if all this was replaced by the word “lane” in the RG information I used. Is it clear what I mean?

If the way I used is not correct, can you please tell me how I should specify the RG information in the correct way?

Thank you very much,

Luisa

From LuisaB on 2015-12-18

Hi,

any news about my question?

Thanks,

Luisa

From Geraldine_VdAuwera on 2015-12-19

Sorry, this sort of question is assigned a low priority because we try to help people with blocking issues first, or issues that are specific to GATK. We’ll make sure someone responds on Monday.

From Geraldine_VdAuwera on 2015-12-21

Hi Luisa,

Most of our team has gone away for the holiday break and I’m not the most knowledgeable about read groups, but here’s my interpretation. If we look at the first example in the doc, we see the difference between ID and PU:

@RG ID:H0164.2 PL:illumina PU:H0164ALXX140820.2 […]

If you consider that the core part of the read name is composed of three segments, you see that ID just contains the first and last segments, whereas PU contains all of it. That’s because the middle segment is the barcode information that is specific to each multiplexed sample (for which at this point you should have separate files). In you case I think this would be broken down like this:

@RG ID:C59MA.4 PL:illumina PU: C59MAACXX.4 […]

Within a single sample’s file this looks redundant but if you were to combine these into multi-sample bams they would provide a useful differentiation per-sample for the purpose of base recalibration.

From LuisaB on 2016-01-04

Hi @Geraldine_VdAuwera,

thank you for your answer.

If I understood it correctly, what you mean is that the part that is present in PU but not in ID should be different in different samples, so PU contains more information than ID and this helps if I have multi-sample bam files.

But the point is that in my case this part HWI-ST0747:467:C59MAACXX is the same for all the samples on all the lanes I use. Only the number after this part changes. For example, in the read name I have HWI-ST0747:467:C59MAACXX:4 if a sample was sequenced on lane 4, @HWI-ST0747:467:C59MAACXX:5 if a sample (the same sample or a different one) was sequenced on lane 5.

So, I don’t add any information adding a part that is the same for all my samples.

Or have I misunderstood something?

Thanks again,

Luisa

From Geraldine_VdAuwera on 2016-01-06

Hi Luisa,

You mentioned that 95 individuals were multiplexed on your 5 lanes. How do you distinguish the 95 individuals? Or are they pooled without separate barcoding?

From LuisaB on 2016-01-06

Hi @Geraldine_VdAuwera,

yes, 95 individuals were multiplexed on each of the 5 lanes. I have single-end RAD-seq data for these individuals (sorry, I think I didn’t specify this before).

I can distinguish the 95 individuals because before the demultiplexing step, each read starts with a barcode, unique for each individual of the lane. It is an inline barcode, so it doesn’t appear in the read name.

Let me know if you need additional information.

Thanks,

Luisa

From Geraldine_VdAuwera on 2016-01-08

I see — then I would recommend using that barcode or something equivalent in place of the code that would normally be included in the read name.

From jfb on 2016-01-11

Hi,

I found your back-and-forth with Luisa helpful to my confusion about how to properly assign read groups to multiplexed samples on a nextseq (i.e., multiple samples are pooled into one library that are then spread across four lanes), but I still can’t tell if you’re suggesting to add the sample barcode into the ID field or keep it only in PU. The example with MOM,DAD,KID is nice but it might be even more helpful if it had some cases where there were multiple samples per lane. Also isn’t the read definition you are using for determining PU vs ID a bit dated? The nextseq header I think is:

@instrument:run_number:flowcell_ID:lane:tile:x-pos:y-pos read:is_filtered:control_number:index_sequence

In my case libraries from tumor and normal from patient A are pooleed and loaded onto a nextseq. The tumorA fastq headers for lane 1are:

NS500126:394:HL5H3BGXX:1:11101:12343:1053 1:N:0:AACCAG and for normalA:

NS500126:394:HL5H3BGXX:1:11101:6442:1053 1:N:0:CACAGT

So I think I should define

ID=HL5H3BGXX.1 and PU=HL5H3BGXX.AACCAG.1and SM=tumorA

ID=HL5H3BGXX.1 and PU=HL5H3BGXX.CACAGT.1 and SM=normalA

Do I have this right?

Thanks, John

P.S. I was reading your other helpful post and trying to think ahead about what happens to these read groups when bams from the same sample (but different lanes) are merged prior to bqsr and indel realignment and then (as in my case of tumor/normal pairs) merged again. It seems like the first merge is fine but wont the RG:IDs clash during the tumor:normal merge?

From shlee on 2016-01-11

Hi John,

You CANNOT use your RG ID definitions. The @RG ID information field is the lowest denominator that distinguishes between samples and factors that contribute to technical artifacts, e.g. different flow cell lanes. The RG ID’s field is applied to every read with the RG tag. For your definition, the ID fields are identical for your tumor and normal samples and when the files are merged for joint analysis, our tools are then unable to distinguish reads from the two different files.

Here are my general recommendations:

(1) If the

RG header information assigned by your sequencing provider allows you to analyze your datasets, then great. Don't mess with it. (2) If not, use Picard's [AddOrReplaceReadGroups](http://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups) to replace your read group information fields so the

RG ID is unique per sample per lane.

Some minor additional comments:

- The @RG ID field is important up to HaplotypeCaller. After a VCF is produced, samples are distinguished by SM (sample name).

- Because the RG tag is repeated for each read, the length of the tag can contribute to file size, e.g. SAM format files. If the files are compressed, e.g. BAM format, then this length should not matter as much.

Thanks for the nextseq header information. I’m new to the team and this is the first time I’ve seen such a format. So I have a question for you in turn. I’m curious if each read name then also contains the read:is_filtered:control_number:index_sequence information. Or if each read queryname contains at least the index_sequence information. If you can clarify, we can keep this in mind for future documentation.

Thanks, and I hope this clarifies your question.

-Soo Hee

From jfb on 2016-01-12

HI Soo Hee,

Thanks for your help. So it sounds like I should instead define:

ID=HL5H3BGXX.AACCAG.1and SM=tumorA

ID=HL5H3BGXX.CACAGT.1 and SM=normalA

and not worry about PU. I suppose for completeness I could put the machine and run info there:

PU=NS500126.394

but it’s probably not necessary.

I’m glad you mentioned that the length of the ID tag doesn’t matter for BAM files…I was regretting defining such a long name.

If I understand your question correctly, we parse the fastqs by sample barcode so the header I showed is the same for every read in the file (except of course for the tile:x-pos:y-pos fields). I believe this type of header is generic for Illumina since at least casava 1.8:

but I suppose the format is probably customizable.

From shlee on 2016-01-12

John,

Your new RGID fields look good in that they now differentiate your samples.

However, RECONSIDER your RGPU field as it has implications for BaseRecalibrator.

Setting all your PU fields identical across samples has implications that you want to avoid. I recommend leaving the PU field at the original unique values (`PU=HL5H3BGXX.AACCAG.1` & `PU=HL5H3BGXX.CACAGT.1`) over removing the PU field. This seems redundant given our tools also require unique RGID fields but is the best option for data processing and record keeping.

The implications you want to avoid have to do with BaseRecalibrator. BaseRecalibrator uses the PU field if present and the PU field takes precedence over RGID as [this user’s post](http://gatkforums.broadinstitute.org/gatk/discussion/4586/read-group-pu-field-used-by-baserecalibrator) illustrates.

Specifically, [BaseRecalibrator](https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php) uses four covariates in determining error models that it then uses to adjust base quality scores. One of these is the RGID or RGPU if present.

Typical approach

We multiplex samples across lanes and use the lane as a covariate in BaseRecalibrator. For example, my DNA, split across different lanes (marked by different RGIDs), would be aggregated and processed as one file through BaseRecalibrator with lane as a covariate. Your DNA would also be aggregated from across lanes and processed as one through BaseRecalibrator. In this approach, each sample is processed separately and is implicitly a covariate.

Your approach

All your samples were sequenced in the same flow cell lane, and so you are unconcerned with lane level covariation. You should still be concerned with sample level covariation. For example, what if one sample was degraded? Would you want to propagate the degraded sample’s error models across all your samples?

- If your RGPU is identical across your samples, then you effectively remove one of the covariates (sample) from consideration in the error models.

- If your RGPU differentiates your samples, then BaseRecalibrator can consider samples (defined by RGPU) as a covariate.

- If RGPU is absent, then BaseRecalibrator uses RGID (which already define samples per lane in your case) as a covariate.

Apologies for the length of this response. Also, thank you for the CASAVA link. Please let me know if anything is unclear.

-Soo Hee

From jfb on 2016-01-13

OK thanks. I’ll update my pipeline accordingly.

From LuisaB on 2016-01-18

Hi @shlee,

thank you for your explanations.

I still have a couple of questions about this topic though. As I wrote here above, I have single-end RAD-seq data for 475 individuals, sequenced in a unique run of a single instrument (same flowcell). On each lane of this flowcell, 95 individuals were multiplexed (so I used 5 lanes). In each batch, all 95 individuals were multiplexed in a single library. So, I have 5 libraries, each containing 95 individuals, and each library was sequenced on a single lane of the flowcell.

Among these samples, I have one sample that was multiplexed in two different libraries and therefore sequenced on two different lanes (as a kind of replicate). The reads belonging to this sample are identified by a barcode for the first replicate (sequenced on lane 5), and by a different barcode for the other replicate (sequenced on lane 8). So, I would write this for the first replicate:

RGID=C59MAACXX.TCGGCAGTCGTGCAG.5

RGPU=C59MAACXX.TCGGCAGTCGTGCAG.5

RGSM=F039_05

RGLB=C59MAACXX.TCGGCAGTCGTGCAG.5

RGPL=illumina

and this for the second replicate:

RGID=C59MAACXX.AAGACCAATCTGCAG.8

RGPU=C59MAACXX.AAGACCAATCTGCAG.8

RGSM=F039_05

RGLB=C59MAACXX.AAGACCAATCTGCAG.8

RGPL=illumina

Is this ok, or the fact that I have two different barcodes for the same sample create problems? May I use the name of the individual (F039_05), instead of the barcode? I mean something like this for the first replicate:

RGID=C59MAACXX.F039_05.5

RGPU=C59MAACXX.F039_05.5

RGSM=F039_05

RGLB=C59MAACXX.F039_05.5

RGPL=illumina

and this for the second replicate:

RGID=C59MAACXX.F039_05.8

RGPU=C59MAACXX.F039_05.8

RGSM=F039_05

RGLB=C59MAACXX.F039_05.8

RGPL=illumina

I’m asking this also because the barcode does not appear in my read headers, that look like this:

@HWI-ST0747:467:C59MAACXX:5:1101:16528:2235 1:N:0:

so I am not sure it is ok to use it in the RG information.

In addition to this, the same barcodes are used in different lanes to multiplex different individuals (e.g. barcode CTAACACGGCTGCAG is used for sample F008_02 in lane 4, and for sample F033_02 in lane 5).

One last question: regarding RGLB, is it ok to write the same value used for ID and PU, or should RGLB be the same for the 95 individuals multiplexed in the same library? e.g. something like C59MAACXX.5 for all the individuals multiplexed in the library sequenced on lane 5?

Sorry for the long post and thank you for your help.

Please let me know if something I wrote is unclear.

Luisa

From shlee on 2016-01-20

Hi @LuisaB,

First question about how to label your single technical/biological replicate

Let me ask you a question in turn. What is your reason for including this technical/biological replicate in your experimental dataset? What are you trying to control for? Depending on your answer to this question, I would conjecture how you use the RG fields would differ. The only thing I’d like to point out is that the RGSM field is what distinguishes samples in the VCF, and your two technical/biological replicates, by virtue of their identical RGSMs, would become indistinguishable. I’m going to guess, based on my own unconfirmed assumptions, that for your aims this is undesirable. If this is so, we’d suggest adding `.5` and `.8` to your replicate sample names to distinguish them.

Second question about RGLB usage

The dictionary entry above states:

> MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

This means that MarkDuplicates is a tool that is aware of the RGLB field.

However, my limited knowledge of RAD-Seq has me remembering it is more like a targeted amplification library where you’d expect mostly duplicates, even more so given the single ended nature of your reads. In this situation, marking duplicates isn’t so relevant. So in this context, I would fathom how you want to use the RGLB field is up to you. For other downstream implications of the RGLB designations, I have to defer to Geraldine_VdAuwera and Sheila.

Until one of them can get back to you or I can check with them, I will also refer you to two posts : (1) [an overview of lane, library, sample and cohort](http://gatkforums.broadinstitute.org/firecloud/discussion/3059/lane-library-sample-and-cohort-what-do-they-mean-and-why-are-they-important) and (2) [a discussion of how MarkDuplicates handles library information](http://gatkforums.broadinstitute.org/gatk/discussion/6199/picard-mark-duplicates-handling-of-library-information).

I hope this is helpful.

-Soo Hee

From Geraldine_VdAuwera on 2016-01-20

@shlee is correct that the RGLB is mostly used for marking duplicates and related QC operations (eg estimating library complexity). Downstream GATK tools do not use RGLB. BaseRecalibrator uses RGID or RGPU if present; all other GATK tools only use RGSM.

Regarding technical replicates, how to treat them depends on the intended purpose of the replication, as @shlee also correctly pointed out. Some people use technical replicates to evaluate statistical properties of their analysis; for that purpose you should label them with different RGSMs. Others use them to accumulate more data per sample; for that purpose you would use the same RGSMs. It’s up to you and your experimental design.

From LuisaB on 2016-01-22

Hi shlee and Geraldine_VdAuwera,

thank you for your replies.

Regarding RGLB, you are correct, I cannot use MarkDuplicates with my data. So I will set RGLB=null for all my samples.

Regarding the replicates, I will not use them as “real” replicates. I had an empty spot and I decided to sequence this individual twice. So I will keep the same value for RGSM, because I want them to be identified as single individual during the SNP calling. But still, I think it is important that BQSR manages to distinguish them, since they were sequenced on different lanes and therefore they will show a different error model. So, what I wanted to know in my previous post is if it is correct to include the barcodes in RGID and RGPU, also if two different barcodes identify the same individual in the two lanes, or if I should rather use directly the name of the individual (F039_05 – as specified in my previous post):

RGID=C59MAACXX.TCGGCAGTCGTGCAG.5

RGPU=C59MAACXX.TCGGCAGTCGTGCAG.5

RGSM=F039_05

and

RGID=C59MAACXX.AAGACCAATCTGCAG.8

RGPU=C59MAACXX.AAGACCAATCTGCAG.8

RGSM=F039_05

or rather:

RGID=C59MAACXX.F039_05.5

RGPU=C59MAACXX.F039_05.5

RGSM=F039_05

and

RGID=C59MAACXX.F039_05.8

RGPU=C59MAACXX.F039_05.8

RGSM=F039_05

also because, as I wrote before, the barcode does not appear in my read headers, and the same barcodes are used in different lanes to multiplex different individuals (e.g. barcode CTAACACGGCTGCAG is used for sample F008_02 in lane 4, and for sample F033_02 in lane 5).

So maybe it would be easier to just use the names of the individuals also in ID and PU (e.g. C59MAACXX.F039_05.5) rather than the barcodes (e.g. C59MAACXX.TCGGCAGTCGTGCAG.5). Is it possible to do this or do I really need the barcodes?

Thanks again,

Luisa

From Geraldine_VdAuwera on 2016-01-22

I wouldn’t recommend using “null” for any required field. That is a reserved word in some programs and may be misinterpreted as a missing value by some. Call it Charlie if you want, just not null or unknown or special words like that.

Ultimately you can use whatever you want in the RG fields as long as they disambiguate the origin of the reads appropriately. We just provide guidelines based on the conventions we use but there are few to no absolute naming rules.

From LuisaB on 2016-01-22

Ok, I will not use “null”.

Thanks,

Luisa

From myprogramming2016 on 2016-03-24

Hi,

I need some help for adding read group to sorted bam files.

I have more than 100 sorted bam files. I would like to use picard tools to add read group.

Could you please share a command or script top add read group in one go?

Thanks

From Sheila on 2016-03-25

@myprogramming2016

Hi,

Hopefully someone will share on here. We do not provide such scripts. You can also ask on some other forums as well.

-Sheila

From benjaminpelissie on 2016-05-10

Hi,

This is a very helpful thread, but I still encounter some doubts about my analyses.

I have 88 samples, paired-end sequenced on Illumina Hiseq, that were multiplexed on the 8 lanes of a same run (i.e., in the same flowcell, giving 10 samples per lane, with tags re-used in different lanes). Here is the first line of the fastq files for sample CPBWGS_02:

HWI-D00256:413:C7N5GANXX:1:1101:1112:2084 1:N:0:ATTACTCGATAGAGGC

HWI-D00256:413:C7N5GANXX:1:1101:1112:2084 2:N:0:ATTACTCGATAGAGGC

Among those 88 samples, 10 were re-sequenced (due to quality issues) in 1 lane of another run (i.e., in another flowcell), using the same multiplexing tags as used in the first run. Here is the first line of the fastq file for sample CPBWGS_02_v2 (which is a re-sequencing of CPBWGS_02):

HWI-D00256:427:HCHMJBCXX:1:1101:1224:2149 1:N:0:ATTACTCGATAGAGGC

HWI-D00256:427:HCHMJBCXX:1:1101:1224:2149 2:N:0:ATTACTCGATAGAGGC

My understanding is that 1. I have to input the tag name in ID (rather than just cell and lane infos) in order to deal with the multiplexing, and 2. I should structure my data per lane in SM, in order to allow an accurate detection of duplicates down the road. Am I right?

Currently, I am defining:

- for CPBWGS_02:

ID=C7N5GANXX.ATTACTCGATAGAGGC.1

SM=CPBWGS_02

LB=C7N5GANXX.1

- for CPBWGS_02_v2:

ID=HCHMJBCXX.ATTACTCGATAGAGGC.1

SM=CPBWGS_02 # I want those reads to be pooled for variant calling

LB=HCHMJBCXX.1

Would that be ok?

As for PU, I was thinking about not defining it (since it would be entirely redundant with ID for all my samples), but @shlee replied to John that it would be problematic for his analyses. Is it relevant in my case?

Many thanks,

Ben

From shlee on 2016-05-17

Hi @benjaminpelissie. We haven’t forgotten you and will get back to you sometime soon about your question.

From benjaminpelissie on 2016-05-18

Thanks @shlee !

From shlee on 2016-05-18

Hi @benjaminpelissie,

I hope I can help you with your questions. Others from my team may jump in with thoughts.

Before delving into your questions, a preamble:

The term multiplexing could be used at two levels. (A) The risk-spreading type, e.g. multiplexing 8 samples into a single library that is run across 8 different lanes. And (B) the cost-saving type, e.g. 8 samples run in a single lane. Both cases require we barcode each sample so that we can identify which reads belong to which samples. When we demultiplex or demux, we use the barcodes to separate out each sample's reads into individual files.

As shown in the trio example, a sample can have multiple libraries, e.g. SM=Dad has 200bp insert and 400bp insert libraries (2 different LB labels). So SM>LB>ID or SM>LB>ID>PU for BQSR. Our tools take the lowest denominator read group component into consideration, typically ID. MarkDuplicates additionally factors for LB (to flag molecular duplicates) and BQSR takes PU into consideration, if present, over ID.

Now some responses to your questions:

From your description--11 samples per lane for 88 samples--it appears option (B) applies to your scenario. Let me ask, what was the reason you had to re-sequence a lane's worth of samples? Were these samples all from the same lane in the initial run? Meaning, are their quality issues associated with the lane they shared or with the samples themselves?

Let me ask another question. Did you reload the same library prep of the samples as the first run for the resequencing or did you remake the sample libraries using the same barcodes? Depending on this and assuming you will still analyze your "bad" data (because you say "I want those reads to be pooled for variant calling"), how you label LB will differ and has impact on marking duplicates. If they come from the same library prep (the same tube you handed to the sequencing center) and assuming you are salvaging the bad lane of data along side the resequenced data, then you'd want to make sure LBs are identical for these Jekyll-and-Hyde sets. This is because MarkDuplicates will consider molecular duplicates for identical LBs despite differing IDs. Optical duplicates are detected per ID group. If they did not come from the same prep (you remade your libraries), then the LB fields should be different.

Yes, you want to differentiate your samples in the same lane. So using the barcode tag as part of the ID is one means to this end. Sample (SM) should refer to the individual you are sequencing, e.g. dad, for whom you want variant calls to be represented in a single column in the VCF. The SM field is not a factor in duplicate flagging as I mentioned above.

for CPBWGS_02: ID=C7N5GANXX.ATTACTCGATAGAGGC.1 SM=CPBWGS_02 LB=C7N5GANXX.1 for CPBWGS_02_v2: ID=HCHMJBCXX.ATTACTCGATAGAGGC.1 SM=CPBWGS_02 # I want those reads to be pooled for variant calling LB=HCHMJBCXX.1

If the library was identical, change your LB fields to be identical. Otherwise, keep them different. Yes, keep your SM fields identical as is. Your ID fields are already different--they distinguish your Jekyll-and-Hyde runs. So this is good.

I don't remember all the thought that went into John's case. What is clear is that your cases are different. It should be fine for you to omit PU and stick with your IDs.

I'd like to add that, the way I understand BQSR, be sure to provide it all the data from a single lane and run it separately for each lane. This means you'll be running 9 instances of BQSR and your Jekyll-and-Hyde sets are run separately. This approach catches and helps to alleviate lane-level issues that can seriously impact the quality of data, such as those caused by under or overloading a lane.

This got really long. Again, I hope this is helpful. Let us know if anything is unclear or if you have more questions.

From benjaminpelissie on 2016-05-24

Thank you @shlee, that really helps. Because our sequencing center should have just reloaded the same library preps to resequence those 10 individuals, I guess should have given my Jekyll-and-Hyde samples the same LB value. However, I am now at a step where I mapped all my samples to a reference and was about to mark duplicates. Can I just run AddOrReplaceReadGroups on the mapped bam files before marking duplicates, or should I start over and re-create uBAM from scratch?

From shlee on 2016-05-24

@benjaminpelissie You can still change the LB labels as you describe (run AddOrReplaceReadGroups on the mapped bam), since MarkDuplicates is the only tool that should care about it. No need to start from scratch. That being said, it occurs to me you have an opportunity to see what are some differences between your Jekyll-and-Hyde sets, at least at the duplicate and library complexity level by running independent MarkDuplicates runs by providing the files separately to MarkDuplicates. After this, you can flag duplicates again for the LB grouping by feeding both Jekyll and Hyde files into a MarkDuplicates run. You or your sequencing center probably have already done or considered such things.

From benjaminpelissie on 2016-05-24

This is a good news! Thank you @shlee. Yes I did consider to mark duplicates at the lane then at the sample (SM) level, although I am realizing that I wouldn’t have been able to do it with different LBs for my Jekyll-and-Hyde sets… so thanks again!

From benjaminpelissie on 2016-05-25

I noted that after modifying RGLB and adding RGPU to my mapped bam, the output files created by AddOrReplaceReadGroups were exactly the same size as input files OR exactly 4 bytes heavier (5 cases out of 11). I don’t know what it means but this is clearly a pattern.

From Geraldine_VdAuwera on 2016-05-25

Is it a pattern that you’re worried about? That sounds like just a harmless artifact of compression.

From RDW on 2016-05-25

> @Geraldine_VdAuwera said:

> – `PU` = Platform Unit

> The `PU` holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. The {FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLE_BARCODE} is a sample/library-specific identifier. Although the `PU` is not required by GATK but takes precedence over `ID` for base recalibration if it is present.

I was a bit puzzled by this explanation, which seems to contradict what I’ve read elsewhere about the definition of the PU field. My impression was that this is intended to differentiate (in the case of Illumina data) between flowcell lanes and not to say anything about the sample or library, which are already described in the SM and LB tags. It would thus be in a form like {FLOWCELL_BARCODE}.{LANE} and not {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. If this interpretation is correct, adding {SAMPLE_BARCODE} to PU would actually make it too specific, since there’s no longer a simple way for downstream tools to (e.g.) operate at the flowcell lane level regardless of the sample or library. In fact, doesn’t BQSR operate per lane, and might not this be the reason why it prefers PU (when available) to the RG ID?

According to the SAM/BAM spec:

“PU: Platform unit (e.g. flowcell-barcode.lane for Illumina or slide for SOLiD). Unique identifer.”

which (as the name ‘platform unit’ suggests) does not mention sample or library. My interpretation of ‘Unique identifier’ in this context would be that it refers to the uniquely identified flowcell lane, and not to what is run on it.

Is this interpretation just wrong, or is the explanation above not quite correct, or does GATK use the PU tag in a non-standard way? Apologies if I’m completely off-base here, or missing something obvious!

Example – I have a data set with multiple samples, each with a single library, multiplexed within a lane but in some cases also run on multiple lanes to increase the depth of coverage. Armed only with the SAM/BAM spec I would probably have constructed the RG ID for each unique combination of sample, fllowcell and lane in the form {SAMPLE}.{FLOWCELL}.{LANE}, the corresponding PU tag as {FLOWCELL}.{LANE}, the SM tag as {SAMPLE} and the LB as {SAMPLE}.lib. My reading of Geraldine’s response to Luisa suggests I may have the contents of the ID and PU tags the wrong way around, but Soo Hee’s response to John (and my reading of the SAM spec) suggests that the RG ID must indeed be unique and take into account the identity of the sample, while (as above) it isn’t obvious to me that the sample belongs in the PU tag. As far as GATK is concerned this may be a moot point once the samples have been demultiplexed into individual files, but I’d rather use all tags in a standard way if possible to avoid any potential issues with third party tools.

Grateful for any comments!

Richard.

From Geraldine_VdAuwera on 2016-05-26

Hi Richard @RDW,

I should clarify that the definition of PU given in this article is in fact a fairly narrow definition that corresponds specifically to our usage of PU in our pre-processing pipeline. You’re correct that the definition given in the spec is more general and could be interpreted quite differently. I believe it was left intentionally vague because to leave some flexibility in analysis design and platform compatibility.

From RDW on 2016-05-26

Hi Geraldine,

Thanks for your quick response!

To me, it seems like this ‘narrow’ definition of PU essentially re-purposes it for a slightly different use, a side effect of which is that it can no longer be used as a straightforward lane identifier, i.e., including a sample identifier in PU seems redundant and makes it less useful for some purposes. It’s kind of a shame that PU is not both mandatory and restricted to {flowcell}.{lane}, which would avoid the issue described in the Biostars post below, discussing an Appistry support query on lane identification:

https://www.biostars.org/p/90317/#92174

But perhaps there is a good reason for defining PU this way, e.g. to create an identifier that can’t be re-written when merging BAMs, etc.

That aside, I’d like to make absolutely sure I’m using the tags in the way that GATK actually expects, so in my multiplex example above, can you confirm that it is OK to set both ID and PU to {SAMPLE}.{FLOWCELL}.{LANE}?

From benjaminpelissie on 2016-05-26

Hi Geraldine. No I didn’t worry especially about this pattern, though didn’t think of the compression either… Thanks!

From shlee on 2016-05-26

@RDW When MarkDuplicates merges BAMs with identical RGIDs, it check the other RG fields, e.g. LB, and if there are differences, it appends a `.1` to the identical RGID to differentiate (`.2` for the third file etc). So I wouldn’t say that RGID is re-written when merging. Rather, the RGID gets appended. This updated RGID is perpetuated to each read record’s RG tag.

Here I’m going out on a limb, and someone may correct me if I’m off-track, but I could imagine some poor sap with not enough data per lane to run BQSR. Their samples were split too many ways across multiple lanes. Let’s say theirs was two of many samples per lane, and they ran four lanes of the same two samples (2×4=8 RGIDs). In this case, could this person sacrifice individual sample identities and run four PUs (each representing one of the four lanes) through one instance of BQSR? Just thinking out loud. If the alternative is to forgo BQSR, what is preferable?

From RDW on 2016-05-26

Thanks for the clarification on the MarkDuplicates behaviour. Regarding multiplexing, I may be the poor sap here – I have some data with a dozen samples multiplexed per lane, but with samples run on 4 different lanes – i.e., 12 read groups per lane, 4 read groups per sample (don’t ask!). How would you suggest I (a) write the ID and PU fields for this type of data and (b) perform BQSR? The discussion in the Biostars link suggests that Appistry have recommended running lane-level BQSR of multiplexed samples in some situations, but it seems I can’t just use the PU tag to identify a lane (or can I?).

From shlee on 2016-05-27

I have issues with some of the statements made on the Biostars post you’ve shared. Instead of responding to those statements, let me instead clarify that BQSR controls for lane level errors that are specific to each sequencing run. Another important thing to remember is that BQSR needs enough data to perform well. [The original BQSR document](http://gatkforums.broadinstitute.org/gatk/discussion/44/base-quality-score-recalibration-bqsr) says the following:

````

- The recalibration system is read-group aware. It separates the covariate data by read group in the recalibration_report.grp file (using @RG tags) and PrintReads will apply this data for each read group in the file. We routinely process BAM files with multiple read groups. Please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.

- A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.

````

[This thread](http://gatkforums.broadinstitute.org/wdl/discussion/comment/14269) adds to this that the minimum of 100M total bases refers to targeted bases x coverage and should exclude duplicates. There are of course other read filters that [BaseRecalibrator](https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php) uses but typically duplicates would have the most impact.

I think you are asking whether you should run (A) 12 instances of BQSR per sample (4 RGIDs per BQSR run), (B) four instances of BQSR per lane (12 RGIDs per BQSR run) or © one instance of BQSR per aggregated lane (4 PUs per BQSR run). We recommend (A) or (B). I think the hypothetical scenerio © may be better than nothing but is less preferable than the other two options. I think typical runs these days, especially for variant calling, should allow you enough data to run either scenario (A) or (B). Is this not the case for you? For both (A) and (B), if PU is present, it should distinguish the ID groupings. Again, to reiterate, if PU is present, BQSR will use PU over ID as the sample-level covariate. How you fill these fields is up to you.

From Sumudu on 2016-09-22

Hi,

Thank you for all the informative content here.

I just want to make sure that my understanding for defining RGLB is correct. 8 individual DNA samples were pooled together and sequenced in a single lane in a single run on illumina MiSeq.

In this case a single LB name can be used for all 8 samples. Below is the RG tags I used. Fastq file header for sample L23334_S2 is: M04362:21:000000000-AU567:1:1101:13786:1609 1:N:0:2

RGID : 000000000-AU567.2.1

RGPU : 000000000-AU567.2.1

SM : L23334_S2

LB : Leish_lib

Please confirm whether this is correct.

Next question is, if I prepared two sets each with 4 multiplexed samples (from the 8 samples I have) & run on two different lanes, are these considered as 2 different libraries?

What if those two sets run on same lane? Should they be considered same library or as two different?

Appreciate your clarification.

Thank you.

Sumudu.

From Sheila on 2016-09-30

@Sumudu

Hi Sumudu,

Sorry for the late response. Can you confirm that the individual samples are not barcoded so you can tell the difference in individuals?

Thanks,

Sheila

From Sumudu on 2016-10-01

@Sheila

Hi,

For reads of each sample a unique combination of Nextera index sequences (I7 and I5) were added before pooling for sequencing. After sequencing I received de-multiplexed samples, for each R1 and R2 separate datasets.

I hope this is the information you are asking.

Thanks

Sumudu

From Sheila on 2016-10-05

@Sumudu

Hi Sumudu,

I have some general recommendations for you.

1) The different insert (500 bp insert and 700 bp insert) libraries from the same sample should have different libraries.

2) Library should reflect sample prep, not lane runs. Lane run differences should be reflected in the RGID.

I hope this helps.

-Sheila

From Sumudu on 2016-10-06

@Sheila

Thank you Sheila. Its clear now.

I have done 300bp insert 8 different sample preps. Run them on same lane in a flowcell after pooling. They should be given unique RGLB tags as well.

Regards

Sumudu.

From Jutta on 2016-11-11

Hey there,

I followed this discussion, however I couldn’t answer all of my questions I have concerning how to determine the read group informations for Illumina RNA-seq data. And I hope you can give me some help!

I have in total 180 RNA-seq samples, of which 20 samples each were pooled and distributed on 9 lanes of 2 flowcells.

Based on these data I would like to do SNP calling according to the best practice guide of GATK. I mapped the reads to the reference genome using TopHat2. However, when I would like to see the read group RG information, I don't get any... (samtools view -H accepted_hits.bam | grep 'RG’ source: https://software.broadinstitute.org/gatk/guide/article?id=6472). So I want to add them my self using picard’s AddOrReplaceReadGroups.

The Name of the Fasta file I obtained from our sequencing facility is:

FCH5Y5TBBXX_L4_HKRDMAIuejTADFRAAPEI-212

And the header of this file is:

@K00114:299:H5y5TBBXX:4:1101:1884:1138

Based on these information I tried to figure out what are my read group information for this sample. According to the description and definition above, I concluded that:

RGLP: ILLUMINA

RGID: H5Y5TBBXX.4 (Flowcell + lane name/number)

RGPU: ? (Flowcell + lane + sample barcode)

RGSM: samplename1

RGLB: ?

For RGPU and RGLB I am not sure how to obtain the right information?

As RGPU is defined as {FLOWCELL_BARCODE}.{LANE}_{SAMPLE_BARCODE} I am not sure where to find the barcode of the sample in the Name of my fasta or the header of my fasta file?

And second, where can I find the information for the Library identifier LB? I didn’t do the library preparation by my self. Can I get this information somewhere?

I would be really happy if you could help me solve those questions! Thank you very much in advance,

Best regards,

Jutta

From Sheila on 2016-11-15

@Jutta

Hi Jutta,

Can you confirm what you mean by “pooled”? Do you mean you don’t know which sample is which? Or, are the samples individually barcoded, and you know which reads come from each sample?

Thanks,

Sheila

From Jutta on 2016-11-16

Hi Sheila,

Sorry, I guess I mean “multiplexed”. My samples are individually barcoded, so I know which reads come from which sample.

Thanks,

Jutta

From Sheila on 2016-11-21

@Jutta

Hi Jutta,

Your PL, SM, and ID look fine. For the SM, since you know the individual samples, you will need to make sure the samples are labeled by their names.

To get the PU (GATK does not require PU), you need to look at the read names in the BAM file, not the FASTA file. Have a look under “Deriving ID and PU fields from read names” in the article.

The LB tag should group the reads that come from the same sample and were processed together in the library preparation stage. For example, in the article example, the DAD has two LBs (LIB-DAD-1 and LIB-DAD-2). Those distinguish the reads with different insert sizes.

I hope this helps.

-Sheila

From Jutta on 2016-11-22

Yes, thank you very much Sheila

From Meli on 2016-11-23

This may be a naive question, but do the ID and PU have to match parts of the header in the BAM file, or is it okay to phrases like “flowcell1.lane1.sample25” as long as you keep track of which samples are in the same flowcell/lane?

Thanks!

From shlee on 2016-11-28

@Meli, You may use your alphanumeric phrases to label the fields.

From Guillaume_D on 2016-12-08

Hi,

I read all the posts of this page and what to put in each read groups appear quite clear to me now (thanks!) but those discussions also rose some new questions to me.

Let me first try to explain my quite complicated sequencing design. I have 90 individuals, 6 individuals/lane and each individuals is sequenced in 3 different lanes (1 individuals = 1 library), so finally I have 45 lanes.

I will add the corrects read groups for each ind of each lane (so 90*3=270 RGID) during the alignment step (bwa-mem in 270 instences).

Then I have to run MarkDuplicates. If I understood correctly the good way to do it is to merge the 3 files of a same individual and run MarkDuplicates per individual (because 1 individuals = 1 library), so 90 instances of MarkDuplicates, is it right ?

After that, I want to run the BQSR and it’s still not clear to me how to do it (after reading this page and this one : http://gatkforums.broadinstitute.org/gatk/discussion/44/base-quality-score-recalibration-bqsr).

Should I run it also per individual (so 90 instances of BQSR), per lane (so 45 instances of BQSR) or per lane per individual (so 270 instances of BQSR) ?

Thanks by advance !

Guillaume

From Geraldine_VdAuwera on 2016-12-09

Hi Guillaume, you are correct on all points.

When you run MarkDuplicates on the 3 readgroups of an individual, it will write out a single bam, and you can run BQSR on that bam for that individual. So you will have 90 instances of BQSR.

From Guillaume_D on 2016-12-09

Hi Geraldine,

Great, I will do that ! Thank you very much !

Guillaume

From ibseq on 2017-01-31

Hi,

if I have an already sorted BAM file from bwa-mem, can I use it directly for markduplicates provided the validatesamfile doesnt give errors? or do I need to go through the sam file first and add the read group at that stage, sort and then mark duplicates and reconvert to bam and sort again?

my bam needs @rg, and i though to add it instead of going back to generate the sam file. will this be ok?

thanks,

ibseq

From Sheila on 2017-02-01

@ibseq

Hi ibseq,

You can run [AddOrReplaceReadGroups](https://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups) on the BAM file straight from BWA. No need to convert to SAM.

-Sheila

From abor on 2017-03-02

I have two bam files from the same individual (i.e. the same individual was sequenced twice). Does it make sense to merge the 2 bam files? Does it bring any advantage in terms of coverage? Could it cause problems in the subsequent haplotypecalling and variant calling steps?

Also I wanted to ask if the following pipeline is correct: I process separately Sample1.bam and Sample2.bam (bwa —> AddOrReplaceReadGroups). Once this is done I use PrintReads to merge Sample1.recal.bam and Sample2.recal.bam, followed by AddOrReplaceReadGroups again to change the tags. Then, on Sample12merged.bam: MarkDuplicates —> Indel realignment —> BQSR. Then I go for the Haplotype calling.

Thank you very much for your help.

From Sheila on 2017-03-06

@abor

Hi,

I think [this article](https://software.broadinstitute.org/gatk/documentation/article?id=6057) will help.

-Sheila

From wisekh on 2017-04-26

Hello,

This forum was really helpful for understanding ReadGroup, but I still have some questions. Your answers will be really appreciated.

Question 1:

Among those 88 samples, 10 were re-sequenced (due to quality issues) in 1 lane of another run (i.e., in another flowcell), using the same multiplexing tags as used in the first run.

Currently, I am defining:

Would that be ok?

read group IDs are composed using the flowcell + lane name and number The PU holds three types of information, the {FLOWCELLBARCODE}.{LANE}.{SAMPLEBARCODE} Is "FLOWCELL_BARCODE" in PU different from "flowcell" in readgroup IDs? I assume both represent identical information which is "FlowCell identification". Is it correct?

Question 2:

I have two BAMs (I would like to merge) from a same sample (which means a same DNA tube I provided to a sequencing center), but they have different barcode sequences (which means they were sequenced separately or on different times). In this case, I think the two BAMs should have different LB fields (i.e. Sample.Barcode1 and Sample.Barcode2). Is this correct?

Thank you in advance,

Hoon

From Sheila on 2017-04-27

@wisekh

Hi Hoon,

1) The flowcell is the same in both PU and ID.

2) Yes, that is correct.

-Sheila

From cjaln1994 on 2017-05-29

Hi,

Thanks for all the useful discussion above. However I’m new to the field & still unsure about my particular RG information.

My situation:

I have 10 replicates of pooled, RNAseq data for each of two samples (10 x Sample A, 10 x Sample B).

By pooled data I mean each replicate has mRNA from multiple individuals all mixed together. Replicates were multiplexed across two lanes, so each replicate has a separate Lane 3 and Lane 4 bam file. I was going to use the following format:

RGID=UniqueReplicateBarcode.LaneNumber

RGLB=UniqueReplicateBarcode

RGSM=SampleA (as opposed to SampleB)

RGPL=Illumina

I am concerned about setting information correctly for the BQSR stage (I’m not marking dupes due to nature of study). I feel like I am not feeding in enough information about having .bam files from different replicates that were run on the same lane! Does BQSR care about grouping reads based on multiplexing samples into the same lanes? I also considered adding an RGPU field identical to what I have for RGID:

RGPU=UniqueReplicateBarcode.LaneNumber

Thanks in advance for any advice,

Chris

From Sheila on 2017-06-05

@cjaln1994

Hi Chris,

Sorry for the delay. I am checking with the team and will get back to you asap.

-Sheila

From Sheila on 2017-06-05

@cjaln1994

Hi Chris,

I am assuming some things here, so please correct me if I am wrong. You have pooled data for two samples. The pooled data consists of reads from 10 replicates (either biological or technical) that will be analyzed as one single sample. You do know which read comes from which replicate (barcoded reads).

If that is correct, your information looks fine except for the RGID. You should simply include the flowcell and lane, not the unique replicate barcode.

If you know which reads come from each individual, you can mark duplicates. We recommend not marking duplicates if you do not know which individual the reads come from. MarkDuplicates will take into account the read group information (specifically LB information).

BQSR works on reads that come form the same lane. The lane is reflected in the RGID, which is why you should include the flowcell and lane. Including the extra information will cause the reads from the same lane but different libraries to be processed separately. You want to process the lane data together in BQSR. Have a look at [this thread](http://gatkforums.broadinstitute.org/gatk/discussion/9437/effect-of-running-bqsr-per-sample-and-not-lane) for more information.

-Sheila

From cjaln1994 on 2017-06-06

Hi @Sheila,

Thanks so much, very helpful! I’ll go back and tweak the read group information. I have posted another question about HaplotypeCaller in a fresh discussion to keep this thread on topic :blush:

Chris

From cdomsar on 2017-10-16

Hi,

Thank you so much for the discussion above. I have read through it and found it very useful, you have a great support team.

I am experiencing a trouble with my bams, however, since I don’t know exactly how should I call my LB groups. I am mapping the raw reads (.fastq files) of the whole genome of a species into the exome of a related species. I am doing this in order to better calculate a nonsynonymous/synonymous mutation ratio in those coding sequences. I want to compare whether mapping these reads with a single-end treatment or paired-end treatment against the exome would be better in order to avoid chimeric alignments and duplicates. I have two paired-end libraries in four .fastq files. One of them has 100-bp-long and the other 250-bp-long inserts (I provide the info of the first read):

FCC0U47ACXX_7_1 (info:

FCC0U47ACXX:7:1101:1141:1978#AAGTCTCT/1) FCC0U47ACXX_7_2 (info:

FCC0U47ACXX:7:1101:1141:1978#AAGTCTCT/2)

FCC0U61ACXX_5_1 (info:

FCC0U61ACXX:5:1101:1903:2155#AAGTCTCT/1) FCC0U61ACXX_5_2 (info:

FCC0U61ACXX:5:1101:1903:2155#AAGTCTCT/2)

Mapping them with bwa as SE gives me 4 files (FCC0U47ACXX_7_1.bam, FCC0U47ACXX_7_2.bam, FCC0U61ACXX_5_1.bam and FCC0U61ACXX_5_2.bam) and 2 files as PE (FCC0U47ACXX_7.bam and FCC0U61ACXX_5.bam).

As far as I understand the example with the DAD/MOM/KID libraries, from each sample, two libraries were constructed, and each library was run in a separate lane. What if both lanes of each library were paired-end reads? Let’s say, ID:FLOWCELL1.LANE1 had forward reads, and ID:FLOWCELL1.LANE2 had reverse reads. In both cases, we are working with the same library (LIB-DAD-1) and this is why we use the same LB.

Would it make sense to add these

RG groups to my bams?

RG ID:FCC0U47ACXX_7_1 PL:ILLUMINA LB:FCC0U47ACXX_7 SM:SAMPLE1 PI:100

RG ID:FCC0U47ACXX_7_2 PL:ILLUMINA LB:FCC0U47ACXX_7 SM:SAMPLE1 PI:100

RG ID:FCC0U61ACXX_5_1 PL:ILLUMINA LB:FCC0U61ACXX_5 SM:SAMPLE1 PI:250

@RG ID:FCC0U61ACXX_5_2 PL:ILLUMINA LB:FCC0U61ACXX_5 SM:SAMPLE1 PI:250

Let’s say I consider that FCC0U47ACXX_7_1 and FCC0U47ACXX_7_2 belong to the same library. If I have overlapping between the forward and reverse reads, wouldn’t MarkDuplicates get rid of the second sequence since it would be considered a duplicate?

Thanks for the support and sorry if the question seems naïve, I am still on my learning curve.

best regards

Carlos

From cdomsar on 2017-10-16

In this case, I am using this command for the first library:

java -Xmx8G -jar /opt/picard-tools/picard.jar AddOrReplaceReadGroups ID=FCC0U47ACXX_7_1 PL=ILLUMINA LB=FCC0U47ACXX_7 SM=SAMPLE1 PI=100 PU=none SORT_ORDER=coordinate INPUT=FCC0U_7_1.sorted.bam OUTPUT=FCC0U_7_1.sorted.rg.bam

From Sheila on 2017-10-19

@cdomsar

Hi Carlos,

>As far as I understand the example with the DAD/MOM/KID libraries, from each sample, two libraries were constructed, and each library was run in a separate lane. What if both lanes of each library were paired-end reads? Let’s say, ID:FLOWCELL1.LANE1 had forward reads, and ID:FLOWCELL1.LANE2 had reverse reads. In both cases, we are working with the same library (LIB-DAD-1) and this is why we use the same LB.

Yes, the library preparation being the same for those reads means the LB field is the same. The different lanes are reflected in the ID field.

>Would it make sense to add these

RG groups to my bams?

RG ID:FCC0U47ACXX_7_1 PL:ILLUMINA LB:FCC0U47ACXX_7 SM:SAMPLE1 PI:100

RG ID:FCC0U47ACXX_7_2 PL:ILLUMINA LB:FCC0U47ACXX_7 SM:SAMPLE1 PI:100

RG ID:FCC0U61ACXX_5_1 PL:ILLUMINA LB:FCC0U61ACXX_5 SM:SAMPLE1 PI:250

@RG ID:FCC0U61ACXX_5_2 PL:ILLUMINA LB:FCC0U61ACXX_5 SM:SAMPLE1 PI:250

Those look correct to me :smile:

>Let’s say I consider that FCC0U47ACXX_7_1 and FCC0U47ACXX_7_2 belong to the same library. If I have overlapping between the forward and reverse reads, wouldn’t MarkDuplicates get rid of the second sequence since it would be considered a duplicate?

I don’t think MarkDuplicates will remove those reads because it takes into account the start position. Those reads would not have the same start position. Have a look at the MarkDuplicates slide deck in the presentations section for more information.

-Sheila

From Erru on 2017-12-08

Hi, I'm having problems with the platform. I have new data from BGISEQ sequencing machine and the program gave an error with that so I changed to 'UNKNOWN' as it was given as a valid option in the error message. The problem is unknown is also giving the error and validateSam gives:

HISTOGRAM java.lang.String

Error Type Count ERROR:INVALIDPLATFORMVALUE 1

So is the unknown a valid options or do I have to claim the data to be something it really isn't to get this working?

My error from BQSR is: ERROR MESSAGE: The platform (UNKNOWN) associated with read group GATKSAMReadGroupRecord @RG:483 is not a recognized platform. Allowable options are ILLUMINA,SLX,SOLEXA,SOLID,454,LS454,COMPLETE,PACBIO,IONTORRENT,CAPILLARY,HELICOS,UNKNOWN

and my GATK version: The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18

From Sheila on 2017-12-14

@Erru

Hi,

Hmm. Have you tried using a different value other than UNKNOWN? Does that still throw an error? Can you post an entire read group record here?

Thanks,

Sheila

From Erru on 2017-12-15

ILLUMINA as platform has worked well before, others I haven’t tested.

I had to change the read groups with picard due to using BGISEQ there when mapping the read. My picard command was for each sample like:

picard AddOrReplaceReadGroups I=dedup_318.R1.sorted.bam O=dedup_318.sorted.bam RGID=318 RGLB=318 RGPL=UNKNOWN RGPU=unit1 RGSM=318

(this is how you use picard in my current computing environment )

Unfortunately BGISEQ doesn’t seem to give lane or other info in the fastq files, that’s why ID, sample and LB are all the same value.

From Sheila on 2017-12-18

@Erru

Hi,

I will have to check with the team why UNKNOWN does not work. In the meantime, I think you can stick to using ILLUMINA as PL even though it is not true. As long as the PL field is the same for the read groups, you should be fine.

-Sheila

From shlee on 2017-12-18

Hi @Erru,

I suspect the UNKNOWN option is some artifact of autogenerating the tool documents from the code. It is something the tool’s code uses as a value when a field’s value does not match any expected value but not a true option available to you. Please use a different value and let us know how it goes.

From danfulop on 2017-12-22

I thought from Erru’s message above from December 8th that “COMPLETE” is allowed for RGPL (read group platform), but `Picard ValidateSamFile` fails to recognize it. I have Complete Genomics data and would like to use an appropriate platform name.

Is that a discrepancy between Picard and GATK? If so, can I ignore the error from `Picard ValidateSamFile` and assume that GATK is okay with COMPLETE as the RGPL?

From shlee on 2017-12-22

Hi @danfulop,

Can you tell us which version of Picard is giving you the error with COMPLETE? And can you clarify whether you’ve tried using it in the PL field with GATK? Thanks.

From sutturka on 2018-01-22

Hi,

I have read through all the comments and indeed they are very helpful to define the correct @RG for my data. I think I know how to correctly define it but I want to get an expert opinion.

Experiment:

I have several tumor samples sequenced on Novaseq. For each sample I get four pairs of fastq files corresponding to four lanes. In the attached image I have described the file name, read names and RG tags derived from read-names. Can you please help to verify these? Based on your recommendation, I am planning to write a script to automatically fetch RG information from the read names.

![](https://us.v-cdn.net/5019796/uploads/editor/zy/08d84asjohxn.png “”)

I have following questions:

1. As per my understanding @RG-ID should be unique to fastq files from each lane of each run. I have assigned combination of (flowcell.barcode.lane) which is unique for each run/lane/sample. Is this assignment correct?

2. The RG-PU field is optional but it gets preference in BQSR. Since, I am assigning same value to ID and PU, can I skip assigning the PU? If it is skipped, will BQSR and other tools will work correctly based on current RG-ID?

3. In initial run, samples were using a primary library. But because of some good reasons the remainder of the same library was subject to a number of amplifications. In later runs, amplified library was sequenced. In such case, should I use the same or different library ID for the amplified runs?

4. For two samples, we have received lower than expected coverage and we will perform additional sequencing by preparing new library. For this data, I will use same @RG-ID convention derived from read_IDs, add different library ID and same sample ID. Is this correct?

Thank you and I appreciate your time for answering this questions.

From sutturka on 2018-01-24

In above question, I have confusion because for each samples one library is loaded on multiple runs and each run have 4 lanes of data. Therefore, should I add:

Separate RGID to each run/each lane (i.e flowcell.barcode.lane combination as shown in image above)

Separate RGID only to each lane because same library is sequenced across multiple runs (i.e combination of barcode.lane)

From Sheila on 2018-01-24

@sutturka

Hi,

1/2)Your read names look fine. Note, the PU is not required, so if you want, you can remove the barcode from the ID and leave the PU out entirely. Your way is correct too.

3) You should use a different library for the newly amplified library.

4) That is correct.

-Sheila

From sutturka on 2018-01-25

Note, the PU is not required, so if you want, you can remove the barcode from the ID and leave the PU out entirely. Your way is correct too.

I think barcode is necessary here. If I exclude the barcode, then RG-ID for the tumor_1 and tumor_98 samples becomes identical. So even with using RG-ID as (flowcell.barcode.lane), I can still exclude the PU without having any impact on BQSR?

From Sheila on 2018-01-26

@sutturka

Hi,

I think the SM tag will distinguish between the two samples even if the RGID is the same. However, it may be more convenient to keep what you have (including the barcode).

Yes, you can exclude the PU field, as GATK does not require it. But, if you keep it, PU will take precedence in BQSR (without it BQSR will take into account ID).

-Sheila

From freek on 2018-05-08

Hi GATK team,

Thank you for this page, great and insightful discussion.

After reading everything though I still have a question. In my lab we use an Illumina NextSeq. When loading the library, one puts the entire sample onto a single chip, the sequencer then fills all 4 lanes of this chip with the same library. Thus, after de-multiplexing one can have a fastq file from 1 sample that comes from multiple lanes. The fastq snippets below are from the same fastq file:

```

@NB501997:54:HTFJ3BGX3:1:11101:2980:1037 2:N:0:GGCTAC

@NB501997:54:HTFJ3BGX3:4:23612:4923:20392 2:N:0:GGCTAC

```

You see that the lanes number differs. This means I cannot add the ID to the read group information when mapping using BWA (bwa mem -M -R $RG …), as it adds 1 single ID per output bam file. So my question is, how do I add this information of lane per read back to the bam file? Or, perhaps, I should treat a NextSeq chip as a single lane and simply use the chip ID as the read group ID because the lanes are connected anyway?

Highest regards,

Freek van Hemert.

From Sheila on 2018-05-10

@freek

Hi Freek,

>how do I add this information of lane per read back to the bam file?

The ID captures the lane information. I think you are asking how to change the read group information? If so, you can use [AddOrReplaceReadGroups](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_AddOrReplaceReadGroups.php).

-Sheila

From freek on 2018-05-14

Hi @Sheile,

Thank you for your response. I was talking about the ID in the Read Group information. I am aware of AddOrReplaceReadGroups, but that program still adds one Read Group line per BAM file and assumes one sample per lane (lane being the smallest unit of technical variance.) But our NextSeq instrument has 4 connected lanes so it produces up to 16 fastq file pairs from all those four lanes combined. I then map them into a BAM file which has the Read Group ID as per your examples: “Chip.lane”. However that would mean every single read in the bam would need it’s own Read Group because reads from the same sample come from different lanes. The headers from my post above are from one single fastq file and one single sample and one single chip, but from multiple lanes… How would the Read Group information look like for such a sample? I feel the best thing to do is leave out the lanes? Or, I should split the lanes into different BAM files (somehow)?

From shlee on 2018-05-14

Hi @freek,

You have a sample sequenced on four different sequencer lanes that were then melded together into the same FASTQ file.

Lane level information is important to preserve for base quality score recalibration (BQSR), a step in the Preprocessing Best Practices.

I believe in our pipelines, lane-level data is separated in to different files. These are processed by [IlluminaBasecallsToSam](https://broadinstitute.github.io/picard/command-line-overview.html#IlluminaBasecallsToSam), which labels each file with a unique read group. Given your data is already in FASTQ format, the equivalent for you is to run [FastqToSam](https://broadinstitute.github.io/picard/command-line-overview.html#FastqToSam). However, so far as I know, this tool expects each file to represent one read group. So you will have to separate out your FASTQ reads into four lane-level files. This should be fairly straight-forward given what I’ve outlined in the Deriving ID and PU fields from read names section of the above document. I had added this section to update the document when I first started with GATK.

AFAIK, GATK and Picard cannot perform such a demultiplexing of reads. However, perhaps you can search BioStars or SeqAnswers for other programs that can. Alternatively, you could use `grep ‘@NB501997:54:HTFJ3BGX3:1:’` and `grep @NB501997:54:HTFJ3BGX3:4:` to sort the reads with Unix. I hope this is helpful.

From freek on 2018-05-14

Hi @shlee,

Thank you this is a very useful answer and the one I feared a bit :smile:

There is still one thing I want to make very explicit though before I’m going to change our pipeline this much and so early in the process (possibly even the base calling itself which we do now with Illumina’s own bcl2fastq).

I have the feeling that at broad you usually use sequencing machines with a much higher throughput than our NextSeq500. For example the HiSeq system, which takes a chip with 6 separately loadable lanes. The NextSeq500 however takes smaller chips with 4 lanes, these lanes cannot be loaded separately, the lanes are effectively 1 chamber, filled with 1 library. knowing this, you would still split the reads by lane before doing the base recalibration steps?

Highest regards,

Freek

From shlee on 2018-05-14

@freek, given

> The NextSeq500 however takes smaller chips with 4 lanes, these lanes cannot be loaded separately, the lanes are effectively 1 chamber, filled with 1 library.

then it seems to me that this chip is effectively a single lane, subject to the same sequencing chemistry. Are the tile coordinates unique to each of the lanes or reused per lane? This is an important consideration. If the former, then I think you could consider treating the entire chip as one lane. If the latter, then I think you need to treat the lanes as separate read groups. I would have to check if GATK/Picard tool features that use tile coordinates (BQSR and MarkDuplicates) take only the RG/PU and read name tile coordinates or if the lane information of the read name is also a consideration. If the former, then definitely you must treat the lanes as separate read groups. If the latter, then you could possibly get away with the lanes being one read group.

I have to say, I think you miss out on leveraging the advantages of true multiplexing. For your NextSeq500, this would, for example, look like barcoded library A mixed with barcoded library B run on two different 4-chamber chips resulting in two different runs.

From freek on 2018-05-29

@shlee, Thank you for your answer, it is now clear. I will analyze per lane, just to be sure, and merge the resulting bam files later on. My experiment now looks like: (4 samples, 4 lanes)

bwa mem per lane per sample (make 16 bams)
Then sort the bams
Merge the bams using markDuplicates
Sort the bams again (this is needed right?)
Run BQSR on the merged bam (which contains reads from a single sample over multiple lanes now but with correct Read Groups so BQSR will be performed correctly)
Initial variant calling,

Highest regards.

From shlee on 2018-05-29

Hi @freek,

You may find [this somewhat outdated reference implementation](https://gatkforums.broadinstitute.org/gatk/discussion/7899/reference-implementation-pairedendsinglesamplewf-pipeline) useful. In the illustrated pipeline, you see that newer versions of MarkDuplicates takes in query-name sorted alignments so there is no need for you to sort your BAMs between BWA MEM and MarkDuplicates. After MarkDuplicates merging, sort and fix the tags. The tag fixing tool is now [SetNmMdAndUqTags](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_SetNmMdAndUqTags.php). You can find updated implementations of the preprocessing pipelines at https://github.com/gatk-workflows/.

From freek on 2018-05-30

Hi @shlee,

The new wdl seem better readable to me and it fixes some t arguments that weren’t working in my shell script. I’m still wonder though about the wgs_coverage_interval_list which I don’t see defined anywhere (some other files I also can’t find), I have asked about this [here](https://gatkforums.broadinstitute.org/gatk/discussion/12039/where-is-known-indels-sites-vcfs-defined) as well but didn’t get a response yet. Can you help me?

Freek.

From freek on 2018-05-30

@shlee I did as you adviced and skip sorting after mapping to immdediatly go for merging using MarkDuplicates but I do get an error:

java.lang.IllegalArgumentException: Alignments added out of order in SAMFileWriterImpl.addAlignment for file:///rst1/2017-0205_illuminaseq/scratch/swo-359/analyzedNoSplitLane/170711_NB501997_0016_AH37NMBGX3/R0016S1/R0016S1.aligned.unsorted.duplicates_marked.bam. Sort order is queryname. Offending records are at [NB501997:16:H37NMBGX3:1:11101:8782:1041] and [NB501997:16:H37NMBGX3:1:11101:25283:1042]

My command is:

echo -e "### GATK MarkDuplicates on BAM file\n" >> \$log echo -e "This step also merges the lanes specific BAM files into a single BAM file\n" >> \$log started gatk MarkDuplicates \\ \$markDuplicatesInput \\ --ASSUME_SORT_ORDER queryname \\ -O \${bamBaseName}.aligned.unsorted.duplicates_marked.bam \\ -M \${bamBaseName}.marked_dup_metrics.txt finished

where $markDuplicatesInput is a list:

--INPUT /rst1/2017-0205_illuminaseq/scratch/swo-359/analyzedNoSplitLane/170711_NB501997_0016_AH37NMBGX3/R0016S1/R0016S1.L1.unsorted.bam --INPUT /rst1/2017-0205_illuminaseq/scratch/swo-359/analyzedNoSplitLane/170711_NB501997_0016_AH37NMBGX3/R0016S1/R0016S1.L2.unsorted.bam --INPUT /rst1/2017-0205_illuminaseq/scratch/swo-359/analyzedNoSplitLane/170711_NB501997_0016_AH37NMBGX3/R0016S1/R0016S1.L3.unsorted.bam --INPUT /rst1/2017-0205_illuminaseq/scratch/swo-359/analyzedNoSplitLane/170711_NB501997_0016_AH37NMBGX3/R0016S1/R0016S1.L4.unsorted.bam

I will sort them for now, that means I can skip SetNmMdAndUqTags them? I will re-sort after the merge.

Highest regards,

Freek.

From shlee on 2018-05-30

Hi @freek,

> Alignments added out of order…Sort order is queryname. Offending records are at [NB501997:16:H37NMBGX3:1:11101:8782:1041] and [NB501997:16:H37NMBGX3:1:11101:25283:1042]

It appears that your inputs are not queryname-sorted in a manner that MarkDuplicates expects. When you ran your alignment, did you provide BWA queryname sorted FASTQs? BWA outputs exactly in the same order as it reads in reads. It sounds like you found a solution that works. If you are interested in revisiting this error, perhaps try single inputs to MarkDuplicates.

From your other thread that you link, it looks like you are finding the cloud resources on your own with some digging. I see:

```

gsutil ls gs://broad-references/hg38/v0/wgs*

gs://broad-references/hg38/v0/wgs_calling_regions.hg38.interval_list

gs://broad-references/hg38/v0/wgs_calling_regions.v1.interval_list

gs://broad-references/hg38/v0/wgs_coverage_regions.hg38.interval_list

gs://broad-references/hg38/v0/wgs_evaluation_regions.hg38.interval_list

gs://broad-references/hg38/v0/wgs_metrics_intervals.interval_list

```

And so I suspect one of these corresponds to the `wgs_coverage_interval_list` you refer to but has been renamed. Just be sure to match exactly the usage found in the pipelined scripts and JSON inputs. If in doubt, be sure to look inside the interval_list to make sure you have what you expect.

From freek on 2018-06-01

Hi @shlee

I answered the last remark in the [other thread](https://gatkforums.broadinstitute.org/gatk/discussion/11186/outlook-on-grch38-hg38-for-in-exome-and-other-targeted-sequencing#latest) to stay on topic :smile:

Freek.

From CarolineJudy on 2018-06-21

Hi @shlee ,

Thank you for this helpful discussion. After reading the above questions and responses, I still have two questions about defining read groups.

1). How do you define ‘lane name’, which is part of the definition given for ID:

“read group IDs are composed using the flowcell + ‘lane name’ and number, making them a globally unique identifier across all sequencing data in the world.”

In Illumina data, I don’t see ‘lane name’ defined, just the lane number is defined:

@‘instrument’:‘run number’:‘flow cell id’:‘lane number’:‘tile’:‘x_coord’:‘y_coord’:‘read’:‘isfiltered’:‘control number’:‘sample number’

What do you recommend using for ‘lane name’?

2. Based on the thread discussion, I think I understand that for cases of multiplexing where uniquely-barcoded libraries (corresponding to unique biological samples) are sequenced together in a single lane to save cost, all libraries would belong to the same read group and thus share the same RGID.

HOWEVER, GATK staff recommends defining a separate read group for each biological sample for purposes of downstream processes.

In my case, I have 8 paired-end cDNA libraries representing 8 unique biological samples (one library per biological sample) multiplexed in a single Illumina lane.

Here is an example read name:

Read1 (F/R), Individual 1

J00160:59:HG5Y5BBXX:8:1101:26058:1244 1:N:0:ACATTGGC

J00160:59:HG5Y5BBXX:8:1101:26058:1244 2:N:0:ACATTGGC

Read1(F/R), Individual 2

J00160:59:HG5Y5BBXX:8:1101:26118:1244 1:N:0:ATGCCTAA

J00160:59:HG5Y5BBXX:8:1101:26118:1244 2:N:0:ATGCCTAA

With some uncertainty, since I don’t understand what ‘lane name’ means (see question one), I would define ID as ‘HG5Y5BBXX8’, which is ‘flow cel id’‘lane number’ for both individuals following the original article’s definition.

However, the two individual would have different PUs:

Individual 1

HG5Y5BBXX8ACATTGGC

Individual 2

HG5Y5BBXX8ATGCCTAA

BUT, based on your recommendations in the above discussion, you would suggest my differentiating the IDs, also. Because of this recommendation, should I just use the PU (above) for the ID field, too?

In that case:

Individual 1

ID: HG5Y5BBXX8ACATTGGC

PU: HG5Y5BBXX8ACATTGGC

Individual 2

ID: HG5Y5BBXX8ATGCCTAA

PU: HG5Y5BBXX8ATGCCTAA

Thank you very much for any assistance you can offer.

CarolineJudy

From shlee on 2018-06-22

Absolutely @CarolineJudy, you can use the unique PUs for your RGIDs.

From CarolineJudy on 2018-06-29

Thank you!

From herb on 2018-10-08

Hi @shlee,

I have a question about read groups.

I used “bismark” to map my raw reads to to hg38 genome. After that I used random barcode to remove duplications and sorted the sam file.

Here is my bam file header(part of them):

HD VN:1.0 SO:coordinate

SQ SN:chr1 LN:248956422

SQ SN:chr2 LN:242193529

SQ SN:chr3 LN:198295559

SQ SN:chr4 LN:190214555

SQ SN:chr5 LN:181538259

…

@PG ID:Bismark VN:v0.15.0 CL:“bismark -bowtie2

I can not see my read groups in my bam file header (start with @RG). So when I using this scrips:

gatk BaseRecalibrator —java-options “-Xmx12G -Djava.io.tmpdir=./”\ -R ~/ref/hg38/hg38_only_chromsomes.fa \ -I ~/5.bismark/test.rm.pair.sort.bam \ -O recal_data.table \ —known-sites ./dbsnp_146.hg38.vcf.gz \ —known-sites ./1000G_phase1.snps.high_confidence.hg38.vcf.gz\ —bqsr-baq-gap-open-penalty 30

I do get an error:

“A USER ERROR has occurred: Number of read groups must be >= 1, but is 0”

I do not know which steps can generate reads groups in sam tag “@RG” ? or if I do not have @RG, I can use “ BaseRecalibrator” to generate “recal.data.table” file ?

Highest regards,

Herb

From XiaoshenYin on 2018-11-19

Hello,

When I proceed with the gatk pipeline for calling variants from RNAseq (GATK | Doc #3891 | Calling variants in RNAseq – Broad Institute). I am not going to add read groups, sort, mark duplicates, and create index using commands below:

java -jar picard.jar AddOrReplaceReadGroups I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample

java -jar picard.jar MarkDuplicates I=rg_added_sorted.bam O=dedupped.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics

However, when I check my fastq files using both head and tail, I got information below:

head -n 4 400_R1.fastq

@A00218:8:H2NVCDMXX:1:1101:2446:1000 1:N:0:AATGCCTC+CTAGCGCT

CCGTCTTCTTCTGCTTGGTCTCCCCAGAGCCGTGGCACAGCGTGCACGGGTTCGTCATGATGGTGCCCTTGCCGGAGCACCGGCGGCACGTGGACCTCATCACAAAGGGCCCTGTGTTTATCGTTTCCATGCCCGTGCCGTTGCAGTAGTG

FFFFFFFFFFFFFFFFFFFFFFFFFFF8FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF—FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF-FFFFFFFFFFFFF8FFF-FFF-FF8F-F

tail -n 4 400_R1.fastq

@A00218:8:H2NVCDMXX:2:2488:23755:37059 1:N:0:AATGCCTC+CTAGCGCT

CCTCAAGCTTCCTCTTTGTGTTGATCAAGCTGGTATTCTGGGAATGCAGCAGCTGCACCCTCTCGGTGGCGCCCAGCAACTCCTGCTCGGCCACCTTCCTGGCCCTCTCCGTTTGCTCCAGGGCGGTGCGTATCTCCTCCACCTCGGCCAG

FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF-FFFFFFFFFFFFFFFFFFFF8FFFFFFFFFF8FFFFFF8

If this is the case, it means that one library is actually sequenced on two lanes. In this case, when I convert one sam file (for one sample and also one library actually) into a bam file, how should I set PU (i.e. should I pick 1 or 2 as flow cell lane) or does it not matter? What effects would it have if I randomly set a PU value?

PS: In my experimental design, each sample is one library, so sample=library.

Thanks.

From shlee on 2018-11-20

Hi @XiaoshenYin,

It appears that your single FASTQ contains reads from multiple lanes. If you wish to follow the GATK Best Practices, especially for the purposes of BQSR, read group must differentiate lanes. Based on what you say:

> I am not going to add read groups, sort, mark duplicates, and create index using commands below

If you will not perform BQSR and are uninterested in lane-level duplicate metrics (sequencing centers are mostly interested in this), then you could presumably assign your data to a single read group. GATK tools require every read be assigned a read group.

If you will split your data by lane, as far as I know, you will have to do this manually, e.g. see [here](https://www.biostars.org/p/78079/), or try to find an external tool that will do this for you. The simplest approach would be to grep.

P.S. It may be easier to process if you convert your FASTQ to an unaligned BAM, e.g. with [FastqToSam](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_FastqToSam.php). Then, if you sort by queryname, your different readgroups would be in blocks.

From XiaoshenYin on 2018-11-20

Hi @shlee ,

Sorry, I meant

I am NOW going to add read groups, sort, mark duplicates, and create index using commands below (not “not”, which was a typo). I want to add read groups, sort, mark duplicates, and create index using those commands. I already got the sam files for fastq files with reads from multiple lanes and I want to move on to the commands I show above. In this case, do I have to go back to fastq files, split reads from different lanes and redo the STAR alignment process and then add read groups, sort, mark duplicates and create index …

If I do not want to go back to fastq files and start over, are there any remedies?

If I set the PU to a consistent value/string no matter from which lane a read is from, will this affect the downstream analyses and final results a lot? What effects would it have?

Thanks.

From shlee on 2018-11-20

Hi @XiaoshenYin,

If your data is already in SAM format, then that’s one step forward. I assume the data is aligned based on:

> redo the STAR alignment process

So you can still assign read groups to your aligned reads with AddOrReplaceReadGroups. Whatever file you provide, the tool assigns the same read group to all the data. So if you want to assign different read groups, you will have to split the data up into separate files. There may be an external tool out there that allows you to do this programmatically, instead of manually. You can also ask for this to be a new feature in AddOrReplaceReadGroups, by posting to the [Picard](https://github.com/broadinstitute/picard/issues) repository.

PU takes precedence over ID for base recalibration if it is present. Other pre-processing steps do not take it into consideration.

From XiaoshenYin on 2018-11-21

Hi @shlee,

Do I have to split fastq files and redo the STAR alignment or can I simply split sam files? Are there any tools you recommend?

Thanks.

From splaisan on 2018-11-27

Sorry to wake this thread up again but I am confused by apparently discordant advices.

I am starting a large variant analysis from 30+ samples processed together.

It is Hiseq4000 data from a single run which has been demultiplexed so I lost record of the barcodes but I can differentiate the samples from the fastq file names being processed.

All my reads have the same name like ‘ @K00335:321:H2L5FBBXY:**1**:1101:2493:1455’ and comes from 4 lanes run from a fully-pooled library (here lane 1).

which I read: device=K00335 flowcell=H2L5FBBXY lane=1 .. coordinates

what is ‘321’ there? part of the flowcell name? it differs between runs

Did I make a valid RG field with this to build my @RG

ID=${flowcell}.${lane}.$(smp} PU= ??? <- same as ID? LB=lib_${smp} PL=ILLUMINA SM=${smp}

The uniqueness requirements for ID and PU are not clear to me

Thanks for your comments and sorry to ask again

Stephane

From splaisan on 2018-11-28

I found what 321 is => run number on that device

a Hiseq4000 (likely any modern Illumina) read name is now:

```

@:::::: :::

```

[reference](https://help.basespace.illumina.com/articles/descriptive/fastq-files/)

one step closer to a full answer :-)

From lidia on 2019-03-29

Hi GATK team!

I am writing because I am not entirely sure of how to define my Read Groups. I have read the documentation but I have some doubts.

My situation is the following: I have 160 DNA samples (each sample is of a different individual) and were pair-end sequenced, which means I have 320 fastq files in total. The headers of my fastq files look like this for an example of 3 samples (forward and reverse):

Fish1_R1:

A00155:120:HG5F2DSXX:4:1101:1723:1000 1:N:0:GACCTGAA+TTGGTGAG Fish1_R2:

A00155:120:HG5F2DSXX:4:1101:1723:1000 2:N:0:GACCTGAA+TTGGTGAG

Fish2_R1:

A00155:121:HG37LDSXX:3:1101:1687:1000 1:N:0:GACGTCTT+GGTTCACC Fish2_R2:

A00155:121:HG37LDSXX:3:1101:1687:1000 2:N:0:GACGTCTT+GGTTCACC

Fish3_R1:

A00155:129:HG55FDSXX:2:1101:1443:1078 1:N:0:TCATCCTT+AGCGAGCT Fish3_R2:

A00155:129:HG55FDSXX:2:1101:1443:1078 2:N:0:TCATCCTT+AGCGAGCT

The headers use the following name scheme:

@ : : : : : : : : :

Considering this, I created the following read groups:

Fish1_R1:

RG ID: HG5F2DSXX.4_Fish1_R1 SM: Fish1_R1 PL: illumina LB: HG5F2DSXX.4 PU: A00155 Fish1_R2:

RG ID: HG5F2DSXX.4_Fish1_R2 SM: Fish1_R2 PL: illumina LB: HG5F2DSXX.4 PU: A00155

Fish2_R1:

RG ID: HG37LDSXX.3_Fish2_R1 SM: Fish2_R1 PL: illumina LB: HG37LDSXX.3 PU: A00155 Fish2_R2:

RG ID: HG37LDSXX.3_Fish2_R2 SM: Fish2_R2 PL: illumina LB: HG37LDSXX.3 PU: A00155

Fish3_R1:

RG ID: HG55FDSXX.2_Fish3_R1 SM: Fish3_R1 PL: illumina LB: HG55FDSXX.2 PU: A00155 Fish3_R2:

RG ID: HG55FDSXX.2_Fish3_R2 SM: Fish3_R2 PL: illumina LB: HG55FDSXX.2 PU: A00155

Will this read groups be alright?

Is okay if as PU I use the information of the instrument?

Thanks a lot for your help!

Lidia

From lidia on 2019-04-01

Oh! sorry, the headers seem not to appear as they should. I send again the corrected message…

Hi GATK team!

I am writing because I am not entirely sure of how to define my Read Groups. I have read the documentation but I have some doubts.

Fish1_R1:

A00155:120:HG5F2DSXX:4:1101:1723:1000 1:N:0:GACCTGAA+TTGGTGAG Fish1_R2:

A00155:120:HG5F2DSXX:4:1101:1723:1000 2:N:0:GACCTGAA+TTGGTGAG

Fish2_R1:

A00155:121:HG37LDSXX:3:1101:1687:1000 1:N:0:GACGTCTT+GGTTCACC Fish2_R2:

A00155:121:HG37LDSXX:3:1101:1687:1000 2:N:0:GACGTCTT+GGTTCACC

Fish3_R1:

A00155:129:HG55FDSXX:2:1101:1443:1078 1:N:0:TCATCCTT+AGCGAGCT Fish3_R2:

A00155:129:HG55FDSXX:2:1101:1443:1078 2:N:0:TCATCCTT+AGCGAGCT

The headers use the following name scheme:

@ < instrument > : < run number > : < flowcell ID > : < lane > : < tile > : < x-pos > : < y-pos > < read > : < is filtered > : < control number > : < barcode sequence >

Considering this, I created the following read groups:

Fish1_R1:

RG ID: HG5F2DSXX.4_Fish1_R1 SM: Fish1_R1 PL: illumina LB: HG5F2DSXX.4 PU: A00155 Fish1_R2:

RG ID: HG5F2DSXX.4_Fish1_R2 SM: Fish1_R2 PL: illumina LB: HG5F2DSXX.4 PU: A00155

Fish2_R1:

RG ID: HG37LDSXX.3_Fish2_R1 SM: Fish2_R1 PL: illumina LB: HG37LDSXX.3 PU: A00155 Fish2_R2:

RG ID: HG37LDSXX.3_Fish2_R2 SM: Fish2_R2 PL: illumina LB: HG37LDSXX.3 PU: A00155

Fish3_R1:

RG ID: HG55FDSXX.2_Fish3_R1 SM: Fish3_R1 PL: illumina LB: HG55FDSXX.2 PU: A00155 Fish3_R2:

RG ID: HG55FDSXX.2_Fish3_R2 SM: Fish3_R2 PL: illumina LB: HG55FDSXX.2 PU: A00155

Will this read groups be alright?

Is okay if as PU I use the information of the instrument?

Thanks a lot for your help!

Lidia

From lidia on 2019-04-05

Hi again! I am so sorry. I think I had the structure misunderstood and the 2 reads (forward and reverse) should have the same Read Group. Is this right? Therefore the read groups should look like this:

Fish1:

RG ID: HG5F2DSXX.4_Fish1 SM: Fish1 PL: illumina LB: HG5F2DSXX.4 PU: A00155 Fish2:

RG ID: HG37LDSXX.3_Fish2 SM: Fish2 PL: illumina LB: HG37LDSXX.3 PU: A00155

Fish3: @RG ID: HG55FDSXX.2_Fish3 SM: Fish3 PL: illumina LB: HG55FDSXX.2 PU: A00155

Finally my command line with bwa for pair-end reads would look like this:

bwa mem -M -R ‘@RG ID: HG5F2DSXX.4_Fish1 SM: Fish1 PL: illumina LB: HG5F2DSXX.4 PU: A00155’ ReferenceGenome/sl.fa RawData/Fish1_R1.fastq.gz RawData/Fish1_2.fastq.gz > AlignedReads/Fish1_aligned_reads.sam

Is it correct?

I will very much appreciate your help

Lidia

From othomps101 on 2019-05-06

Hi @Geraldine_VdAuwera

I have 3 different yeast strains (1 parent and 2 evolved isolates) sequenced individually via pacbio and illumina (300bp PE) which resulted in 3 fastq files per strain (RSII, R1, R2). I have two bam files per strain, one with pacbio aligned reads and one with illumina R1+R2 aligned reads. Since each strain only had one library per platform, and given the head and tail read names for each file, does it makes sense to use:

m(date_time)_instrument#_SMRTcellbarcode_set#_part#/ZMW#/(subreadregion start_stop) > StrainA_RSII_filtered_subreads.fastq <

m170920_170309_42154_c101239312550000001823289412091701_s1_p0/108988/0_10862

m170920_170309_42154_c101239312550000001823289412091701_s1_p0/108987/14264_21198 > StrainB_RSII_filtered_subreads.fastq <

m170921_054339_42154_c101239312550000001823289412091703_s1_p0/108989/0_10812

m170921_054339_42154_c101239312550000001823289412091703_s1_p0/108987/0_4749 > StrainC_RSII_filtered_subreads.fastq <

m170921_182426_42154_c101239312550000001823289412091705_s1_p0/108988/7096_9738

@m170921_182426_42154_c101239312550000001823289412091705_s1_p0/108987/56857_72226

RGSM=strain_name

RGPL= pacbio

RGLB=strain_name.pacbio

RGID=strain_name.pacbio

RGPU=smrtcellbarcode.strain

instrument:run#:flowcellID:lane:tile:xpos:ypos read1or2:filtered?:control#:sample# > StrainA_S1_L001_R1_001.fastq <

M02780:209:000000000-B69JD:1:1101:10019:1252 1:N:0:NGACCA

M02780:209:000000000-B69JD:1:1101:10019:1252 2:N:0:NGACCA

M02780:209:000000000-B69JD:1:2119:15110:25287 1:N:0:TGACCA

M02780:209:000000000-B69JD:1:2119:15110:25287 2:N:0:TGACCA > StrainB_S2_L001_R1_001.fastq <

M02780:209:000000000-B69JD:1:1101:9381:1251 1:N:0:NCCAAT

M02780:209:000000000-B69JD:1:1101:9381:1251 2:N:0:NCCAAT

M02780:209:000000000-B69JD:1:2119:11733:25286 1:N:0:GCCAAT

M02780:209:000000000-B69JD:1:2119:11733:25286 2:N:0:GCCAAT > StrainC_S3_L001_R1_001.fastq <

M02780:209:000000000-B69JD:1:1101:8914:1253 1:N:0:NTTGTA

M02780:209:000000000-B69JD:1:1101:8914:1253 2:N:0:NTTGTA

M02780:209:000000000-B69JD:1:2119:21190:25287 1:N:0:CTTGTA

@M02780:209:000000000-B69JD:1:2119:21190:25287 2:N:0:CTTGTA

RGSM=strain_name

RGPL= illumina

RGLB=strain_name

RGID=strain_name.illumina

RGPU=flowcellid.lane.strain (even though flowcellid.lane is the same)

From AISHAYOUSAF on 2019-06-19

Hi,

I am having trouble assigning RGIDs to following sample. These are 4 different runs belonging to 1 sample which are sequenced on single lane (3). However, instrument ID and run is different for each. In this case can I assign same RGID to them or I should assign them unique RGIDs when aligning with bwa-mem?

SRR740787.1.1 HWI-ST539_51:3:1:1491:2031 length=100

SRR740790.1.1 HWI-ST457_154:3:1:1105:1994 length=100

SRR740792.1.1 HWI-ST459_72:3:1:1592:1726 length=100

SRR740793.1.1 HWI-ST457_153:3:1:1356:1922 length=100

Looking for help in this regard! Thanks!

Report abuse