created by Geraldine_VdAuwera
on 2013-08-02
There are four major organizational units for next-generation DNA sequencing processes that we use throughout the GATK documentation:
- Lane: The basic machine unit for sequencing. The lane reflects the basic independent run of a high-throughput sequencing machine. For Illumina machines, this is the physical sequencing lane.
- Library: A unit of DNA preparation that at some point is physically pooled together. Multiple lanes can be run from aliquots from the same library. The DNA library and its preparation is the natural unit that is being sequenced. For example, if the library has limited complexity, then many sequences are duplicated and will result in a high duplication rate across lanes. If working with RNAseq, the library preparation process involves reverse transcription into cDNA.
- Sample: A biological sample coming from a single individual. Multiple libraries with different properties can be constructed from the original sample DNA source. Throughout our documentation, we treat samples as independent individuals whose genome sequence we are attempting to determine. Note that from this perspective, tumor / normal samples are different despite coming from the same individual.
- Cohort: A collection of samples being analyzed together. This organizational unit is the most subjective and depends very specifically on the design goals of the sequencing project. For population discovery projects like the 1000 Genomes, the analysis cohort is the ~100 individual in each population. For exome projects with many deeply sequenced samples (e.g. ESP with 800 EOMI samples) we divide up the complete set of samples into cohorts of ~50 individuals for multi-sample analyses.
Note that many GATK commands can be run at the lane level, but will give better results seeing all of the data for a single sample, or even all of the data for all samples. Unfortunately, there’s a trade-off in computational cost, since running these commands across all of your data simultaneously requires much more computing power. Please see the documentation for each step to understand what is the best way to group or partition your data for that particular process.
Updated on 2017-12-24
From rxy712 on 2015-11-04
Thank you for the information, which is helpful! I am wondering where to find the library information to put into the read group. I still do not quite understand library. I have flowcell ID or sample ID, not sure if either is the right one? I know that library information is used for markduplicates, but without that information, would the result change a lot?
Thank you very much!
From Geraldine_VdAuwera on 2015-11-05
It depends on your experimental design. Do you know how many libraries were prepared for each sample, and how they were arranged on the flowcells?
From rxy712 on 2015-11-05
Let’s say, if I have paired tumor normal samples for 10 patients, then there are 20 samples in total. I think in this case, sample ID is the the same as library ID, and I have 20 libraries, right? Thank you!
From rxy712 on 2015-11-05
added: each sample is sequenced in the same flowcell but on different lanes.
From Sheila on 2015-11-24
@rxy712
Hi,
You can put all the information into proper read groups with the help of this article: http://gatkforums.broadinstitute.org/discussion/6472/read-groups
-Sheila