created by GATK_Team
on 2017-12-24
Our Best Practices pre-processing documentation assumes a simple experimental design in which you have one set of input sequence files (forward/reverse or interleaved FASTQ, or unmapped uBAM) per sample, and you run each step of the pre-processing workflow separately for each sample, resulting in one BAM file per sample at the end of this phase.
However, if you are generating multiple libraries for each sample, and/or multiplexing samples within and/or across sequencing lanes, the data must be de-multiplexed before pre-processing, typically resulting in multiple sets of FASTQ files per sample all of which should have distinct read group IDs (RGID).
At that point there are several different valid strategies for implementing the pre-processing workflow. Here at the Broad Institute, we run the initial steps of the pre-processing workflow (mapping and sorting) separately on each individual read group. Then we merge the data to produce a single BAM file for each sample (aggregation); this is done conveniently at the same time that we do the duplicate marking, by running Mark Duplicates on all read group BAM files for a sample at the same time. Then we run Base Recalibration on the aggregated per-sample BAM files. See the worked-out example below for more details.
Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.
Let's say we have this example data (assuming interleaved FASTQs containing both forward and reverse reads) for two sample libraries, sampleA and sampleB, which were each sequenced on two lanes, lane1 and lane2:
These will each be identified as separate read groups A1, A2, B1 and B2. If we had multiple libraries per sample, we would further distinguish them (eg sampleAlib1lane1.fq leading to read group A11, sampleAlib2lane1.fq leading to read group A21 and so on).
Assuming that you received one FASTQ file per sample library, per lane of sequence data (which amounts to a read group), run each file through mapping and sorting. During the mapping step you assign read group information, which will be very important in the next steps so be sure to do it correctly. See the read groups dictionary entry for guidance.
The example data becomes:
Note that we used to do a first round of marking duplicates here for QC purposes but tool improvements have rendered this obsolete.
Once you have pre-processed each read group individually, you merge read groups belonging to the same sample into a single BAM file. You can do this as a standalone step, bur for the sake of efficiency we combine this with the per-sample duplicate marking step (it's simply a matter of passing the multiple inputs to MarkDuplicates in a single command).
The example data becomes:
This process of marking duplicates eliminates PCR duplicates (arising from library preparation) across all lanes in addition to optical duplicates (which are by definition only per-lane).
Then all that's left to run is base recalibration (BQSR). We used to have another step here called indel realignment, but assuming you're using a modern caller like HaplotypeCaller or Mutect2, you don't need to do it. You would only do so if you were using a locus-based variant caller like MuTect 1 or UnifiedGenotyper, but why would you want to do that? (You really don't)
The example data becomes:
Base recalibration will be applied per-read group if you assigned appropriate read group information in your data. BaseRecalibrator distinguishes read groups by RGID, or RGPU if it is available (PU takes precedence over ID). This will identify separate read groups (distinguishing both lanes and libraries) as such even if they are in the same BAM file, and it will always process them separately -- as long as the read groups are identified correctly of course. There would be no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine (assuming the equipment is Illumina HiSeq or similar technology).
From kami_king on 2018-09-10
Excume me, I still have a question about merged_lane bam file snv calling part, how to config Haplotypercaller command to avoid different lanes separate, or the vcf file avoid different lanes, only the site depth information? Thank you very much.
From Sheila on 2018-09-14
@kami_king
Hi,
Have a look at the [Read Group entry](https://software.broadinstitute.org/gatk/documentation/article.php?id=6472). As long as your SM tag is identified properly, you will be fine.
-Sheila
From btkoay on 2019-09-06
Dear GATK team,
I have a question for this tutorial. In my WES sample, I found out that the reads are from 3 separate sequence identifier; i.e.
FlowcellA:lane1
FlowcellA:lane2
FlowcellB:lane5
They were prepared from one library, but somehow, the service company loaded them this way during sequencing run.
Regarding the statement below, I am wondering if it is advisable for me to separate the reads into 3 separate fastq files. Then, add the respective read groups information for all 3 fastq file. Merge them back into one fastq file, and then only start with the pre-processing workflow?
This way, I would only do mapping and sorting once, as compared to 3 times for each sample.
“Here at the Broad Institute, we run the initial steps of the pre-processing workflow (mapping and sorting) separately on each individual read group. Then we merge the data to produce a single BAM file for each sample (aggregation);”
Thank you,
Koay