created by shlee
on 2016-01-08
This tutorial updates Tutorial#2799.
Here we discuss two tools, MarkDuplicates and MarkDuplicatesWithMateCigar, that flag duplicates. We provide example data and example commands for you to follow along the tutorial (section 1) and include tips in estimating library complexity for PCR-free samples and patterned flow cell technologies. In section 2, we point out special memory considerations for these tools. In section 3, we highlight the similarities and differences between the two tools. Finally, we get into some details that may be of interest to some that includes comments on the metrics file (section 4).
Obviously, expect more duplicates for samples prepared with PCR than for PCR-free preparations. Duplicates arise from various sources, including within the sequencing run. As such, even PCR-free data can give rise to duplicates, albeit at low rates, as illustrated here with our example data.
The Best Practices so far recommends MarkDuplicates. However, as always, consider your research goals.
If your research uses paired end reads and pre-processing that generates missing mates, for example by application of an intervals list or by removal of reference contigs after the initial alignment, and you wish to flag duplicates for these remaining singletons, then MarkDuplicatesWithMateCigar will flag these for you at the insert level using the mate cigar (MC) tag. MarkDuplicates skips these singletons from consideration.
If the qualities by which the representative insert in a duplicate set is selected is important to your analyses, then note that MarkDuplicatesWithMateCigar is limited to prioritizing by the total mapped length of a pair, while MarkDuplicates can use this OR the default sum of base qualities of a pair.
If you are still unsure which tool is appropriate, then consider maximizing comparability to previous analyses. The Broad Genomics Platform has used only MarkDuplicates in their production pipelines. MarkDuplicatesWithMateCigar is a newer tool that has yet to gain traction.
This tutorial compares the two tools to dispel the circulating notion that the outcomes from the two tools are equivalent and to provide details helpful to researchers in optimizing their analyses.
We welcome feedback. Share your suggestions in the Comment section at the bottom of this page.
Jump to a section
Tools involved
Prerequisites
**Revision as of 5/17/2016:**
I wrote this tutorial at a time when the input could only be an indexed and coordinate-sorted BAM. Recently, the tools added a feature to accept queryname-sorted inputs that in turn activates additional features. The additional features that providing a queryname-sorted BAM activates will give DIFFERENT duplicate flagging results. So for the tutorial's observations to apply, use coordinate-sorted data.Download example data
snippet.bam
. The data has (i) no supplementary records; (ii) secondary records flagged with the 256 flag _and the mate-unmapped (8) flag; and (iii) unmapped records (4 flag) with mapped mates (mates have 8 flag), zero MAPQ (column 5) and asterisks for CIGAR (column 6). The notation allows read pairs where one mate maps and the other does not to sort and remain together when we apply genomic intervals such as in the generation of the snippet.Related resources
The following commands take a coordinate-sorted and indexed BAM and return (i) a BAM with the same records in coordinate order and with duplicates marked by the 1024 flag, (ii) a duplication metrics file, and (iii) an optional matching BAI index.
For a given file with all MC (mate CIGAR) tags accounted for:
snippet
example data.Use the following commands to flag duplicates for 6747_snippet.bam
. These commands produce qualitatively different data.
Score duplicate sets based on the sum of base qualities using MarkDuplicates:
java -Xmx32G -jar picard.jar MarkDuplicates \ INPUT=6747_snippet.bam \ #specify multiple times to merge OUTPUT=6747_snippet_markduplicates.bam \ METRICS_FILE=6747_snippet_markduplicates_metrics.txt \ OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \ #changed from default of 100 CREATE_INDEX=true \ #optional TMP_DIR=/tmp
Score duplicate sets based on total mapped reference length using MarkDuplicatesWithMateCigar:
java -Xmx32G -jar picard.jar MarkDuplicatesWithMateCigar \ INPUT=6747_snippet.bam \ #specify multiple times to merge OUTPUT=6747_snippet_markduplicateswithmatecigar.bam \ METRICS_FILE=6747_snippet_markduplicateswithmatecigar_metrics.txt \ OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \ #changed from default of 100 CREATE_INDEX=true \ #optional TMP_DIR=/tmp
**Revision as of 5/17/2016:**
The example input 6747_snippet.bam
is coordinate-sorted and indexed. Recently, the tools added a feature to accept queryname-sorted inputs that in turn by default activates additional features that will give DIFFERENT duplicate flagging results than outlined in this tutorial. Namely, if you provide MarkDuplicates a queryname-sorted BAM, then if a primary alignment is marked as duplicate, then the tool will also flag its (i) unmapped mate, (ii) secondary and/or (iii) supplementary alignment record(s) as duplicate.DUPLICATE_SCORING_STRATEGY
. For MarkDuplicatesWithMateCigar it is TOTALMAPPEDREFERENCELENGTH and this is the _only scoring strategy available. For MarkDuplicates you can switch the DUPLICATE_SCORING_STRATEGY
between the default SUMOFBASEQUALITIES and TOTALMAPPEDREFERENCELENGTH. Both scoring strategies use information pertaining to both mates in a pair, but in the case of MarkDuplicatesWithMateCigar the information for the mate comes from the read's MC tag and not from the actual mate.INPUT
parameter for each file. The tools merge the read records from the multiple files into the single output file. The tools marks duplicates for the entire library (RGLB) and accounts for optical duplicates per RGID. INPUT
files must be coordinate sorted and indexed.OPTICAL_DUPLICATE_PIXEL_DISTANCE
to 2500, to better estimate library complexity. The default setting for this parameter is 100. Changing this parameter does not alter duplicate marking. It only changes the count for optical duplicates and the library complexity estimate in the metrics file in that whatever is counted as an optical duplicate does not factor towards library complexity. The increase has to do with the fact that our example data was sequenced in a patterned flow cell of a HiSeq X machine. Both HiSeq X and HiSeq 4000 technologies decrease pixel unit area by 10-fold and so the equivalent pixel distance in non-patterned flow cells is 250. You may ask why are we still counting optical duplicates for patterned flow cells that by design should have no optical duplicates. We are hijacking this feature of the tools to account for other types of duplicates arising from the sequencer. Sequencer duplicates are not limited to optical duplicates and should be differentiated from PCR duplicates for more accurate library complexity estimates.REMOVE_DUPLICATES
parameter to true. However, given you can set GATK tools to include duplicates in analyses by adding -drf DuplicateRead
to commands, a better option for value-added storage efficiency is to retain the resulting marked file over the input file..bai
index, add and set the CREATE_INDEX
parameter to true.For snippet
, the duplication metrics are identical whether marked by MarkDuplicates or MarkDuplicatesWithMateCigar. We have 13.4008% duplication, with 255 unpaired read duplicates and 18,254 paired read duplicates. However, as the screenshot at the top of this page illustrates, and as section 4 explains, the data qualitatively differ.
The seemingly simple task of marking duplicates is one of the most memory hungry processes, especially for paired end reads. Both tools are compute-intensive and require upping memory compared to other processes.
Because of the single-pass nature of MarkDuplicatesWithMateCigar, for a given file its memory requirements can be greater than for MarkDuplicates. What this means is that MarkDuplicatesWithMateCigar streams the duplicate marking routine in a manner that allows for piping. Due to these memory constraints for MarkDuplicatesWithMateCigar, we recommend MarkDuplicates for alignments that have large reference skips, e.g. spliced RNA alignments.
For large files, (1) use the Java -Xmx
setting and (2) set the environmental variable TMP_DIR
for a temporary directory. These options allow the tool to run without slowing down as well as run without causing an out of memory error. For the purposes of this tutorial, commands are given as if the example data is a large file, which we know it is not.
java -Xmx32G -jar picard.jar MarkDuplicates \ ... \ TMP_DIR=/tmp
These options can be omitted for small files such as the example data and the equivalent command is as follows.
java -jar picard.jar MarkDuplicates ...
The high memory cost, especially for MarkDuplicatesWithMateCigar, is in part because the tool systematically traverses genomic coordinate intervals for inserts in question, and for every read it marks as a duplicate it must keep track of the mate, which may or may not map nearby, so that reads are marked as pairs with each record emitted in its coordinate turn. In the meanwhile, this information is held in memory, which is the first choice for faster processing, until the memory limit is reached, at which point memory spills to disk. We set this limit high to minimize instances of memory spilling to disk.
In the example command, the -Xmx32G
Java option caps the maximum heap size, or memory usage, to 32 gigabytes, which is the limit on the server I use. This is in contrast to the 8G setting I use for other processes on the same sample data--a 75G BAM file. To find a system's default maximum heap size, type java -XX:+PrintFlagsFinal -version
, and look for MaxHeapSize
.
When the tool hits the memory limit, memory spills to disk. This causes data to traverse in and out of the processor's I/O device, slowing the process down. Disk is a location you specify with the TMP_DIR
parameter. If you work on a server separate from where you read and write files to, setting TMPDIR to the server's local temporary directory (typically /tmp
) can reduce processing time compared to setting it to the storage disk. This is because the tool then additionally avoids traversing the network file system when spilling memory. Be sure the TMPDIR location you specify provides enough storage space. Use df -h
to see how much is available.
The aim of duplicate marking is to flag all but one of a duplicate set as duplicates and to use duplicate metrics to estimate library complexity. Duplicates have a higher probability of being non-independent measurements from the exact same template DNA. Duplicate inserts are marked by the 0x400 bit (1024 flag) in the second column of a SAM record, for each mate of a pair. This allows downstream GATK tools to exclude duplicates from analyses (most do this by default). Certain duplicates, i.e. PCR and sequencer duplicates, violate assumptions of variant calling and also potentially amplify errors. Removing these, even at the cost of removing serendipitous biological duplicates, allows us to be conservative in calculating the confidence of variants.
For a whole genome DNA sample, duplicates arise from three sources: (i) in DNA shearing from distinct molecular templates identical in insert mapping, (ii) from PCR amplification of a template (PCR duplicates), and (iii) from sequencing, e.g. optical duplicates. The tools cannot distinguish between these types of duplicates with the exception of optical duplicates. In estimating library complexity, the latter two types of duplicates are undesirable and should each factor differently.
When should we not care about duplicates? Given duplication metrics, we can make some judgement calls on the quality of our sample preparation and sequencer run. Of course, we may not expect a complex library if our samples are targeted amplicons. Also, we may expect minimal duplicates if our samples are PCR-free. Or it may be that because of the variation inherent in expression level data, e.g. RNA-Seq, duplicate marking becomes ritualistic. Unless you are certain of your edge case (amplicon sequencing, RNA-Seq allele-specific expression analysis, etc.) where duplicate marking adds minimal value, you should go ahead and mark duplicates. You may find yourself staring at an IGV session trying to visually calculate the strength of the evidence for a variant. We can pat ourselves on the back for having the forethought to systematically mark duplicates and turn on the IGV duplicate filter.
GATK tools allow you to disable the duplicate read filter with -drf DuplicateRead
so you can include duplicates in analyses.
The Broad's Genomics Platform uses MarkDuplicates twice for multiplexed samples. Duplicates are flagged first per sample per lane to estimate lane-level library complexity, and second to aggregate data per sample while marking all library duplicates. In the second pass, duplicate marking tools again assess all reads for duplicates and overwrite any prior flags.
Our two duplicate flagging tools share common features but differ at the core. As the name implies, MarkDuplicatesWithMateCigar uses the MC (mate CIGAR) tag for mate alignment information. Unlike MarkDuplicates, it is a single-pass tool that requires pre-computed MC tags.
N
in the CIGAR string.PRIMARY_ALIGNMENT_STRATEGY
parameter for strategies the tool considers for changing primary markings made by an aligner.To reach a high target coverage depth, some fraction of sequenced reads will by stochastic means be duplicate reads.
Let us hope the truth of a variant never comes down to so few reads that duplicates should matter so. Keep in mind the better evidence for a variant is the presence of overlapping reads that contain the variant. Also, take estimated library complexity at face value--an estimate.
First, let me reiterate that secondary and supplementary alignment records are skipped and never flagged as duplicate.
Given a file with no missing mates, each tool identifies the same duplicate sets from primary alignments only and therefore the same number of duplicates. To reiterate, the number of identical loci or duplicate sets and the records within each set are the same for each tool. However, each tool differs in how it decides which insert(s) within a set get flagged and thus which insert remains the representative nondup. Also, if there are ties, the tools may break them differently in that tie-breaking can depend on the sort order of the records in memory.
We can break down the metrics file into two parts: (1) a table of metrics that counts various categories of duplicates and gives the library complexity estimate, and (2) histogram values in two columns.
See DuplicationMetrics for descriptions of each metric. For paired reads, duplicates are considered for the insert. For single end reads, duplicates are considered singly for the read, increasing the likelihood of being identified as a duplicate. Given the lack of insert-level information for these singly mapping reads, the insert metrics calculations exclude these.
The library complexity estimate only considers the duplicates that remain after subtracting out optical duplicates. For the math to derive estimated library size, see formula (1.2) in Mathematical Notes on SAMtools Algorithms.
The histogram values extrapolate the calculated library complexity to a saturation curve plotting the gains in complexity if you sequence additional aliquots of the same library. The first bin's value represents the current complexity.
Here we refer you to a five minute video illustrating what happens at the molecular level in a typical sequencing by synthesis run.
What I would like to highlight is that each strand of an insert has a chance to seed a different cluster. I will also point out, due to sequencing chemistry, F1 and R1 reads typically have better base qualities than F2 and R2 reads.
Let us work out the implications of this for a paired end, unstranded DNA library. During sequencing, within the flow cell, for a particular insert produced by sample preparation, the strands of the insert are separated and each strand has a chance to seed a different cluster. Let's say for InsertAB, ClusterA and ClusterB and for InsertCD, ClusterC and ClusterD. InsertAB and InsertCD are identical in sequence and length and map to the same loci. It is possible InsertAB and InsertCD are PCR duplicates and also possible they represent original inserts. Each strand is then sequenced in the forward and reverse to give four pieces of information in total for the given insert, e.g. ReadPairA and ReadPairB for InsertAB. The pair orientation of these two pairs are reversed--one cluster will give F1R2 and the other will give F2R1 pair orientation. Both read pairs map exactly to the same loci. Our duplicate marking tools consider ReadPairA and ReadPairB in the same duplicate set for regular duplicates but not for optical duplicates. Optical duplicates require identical pair orientation.
Updated on 2016-05-25
From shlee on 2016-02-08
Hi @Longinotto,
The aim of the lane-level marking is different than the library-level duplicate marking. The lane-level marking is done earlier in the pipeline, before aggregation, to control for lane-specific variation. The Broad’s Genomics Platform does this step if it’s possible and I’m told they may consider removing this step in the future. I gather any sequencing facility would have an interest in lane-level duplication rates, for noting duplicates that arise during sequencing, for optimization purposes. For those of us on the data receiving end, we may or may not be interested in lane-level information depending on the quality assured by our sequencing provider.
The library-level duplicate marking accomplishes two things. One, it merges all the BAMs. Two, it marks all the duplicates that we want to discount from our analyses and hide from IGV view.
We use the library complexity estimate + the histogram values (both provided in the metrics file) in assessing the quality of our libraries IN CONJUNCTION with coverage information. For example, the Genomics Platform uses the coverage information to assess whether they have reached promised coverage (80%) at the promised depths (30x). If the coverage target is not met, then the histogram values (derived from the complexity estimate) tell us how much more of the library should be sequenced to reach the target.
I’m glad you find the article useful. And I hope I have answered your questions. Let me know if otherwise.
-Soo Hee
From Nebetbastet on 2016-05-17
Thank you for this tutorial. Even if it is very clear, I met some problems. Could you help me ?
I tried to use Markduplicates, following your tutorial, to estimate the number of duplicates due to the sequencer. I worked with a RNAseq 50bp single-end dataset, sequenced with the Hiseq4000. This dataset has lots of duplicates.
Here are the used parameters:
OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500
MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000
SORTING_COLLECTION_SIZE_RATIO=0.25
REMOVE_SEQUENCING_DUPLICATES=false
TAGGING_POLICY=DontTag
REMOVE_DUPLICATES=false
ASSUME_SORTED=false
DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES
PROGRAM_RECORD_ID=MarkDuplicates
PROGRAM_GROUP_NAME=MarkDuplicates
READ_NAME_REGEX=“optimized capture of last three ‘:’ separated fields as numeric values” (dafault)
VERBOSITY=INFO
QUIET=false
VALIDATION_STRINGENCY=STRICT
COMPRESSION_LEVEL=5
MAX_RECORDS_IN_RAM=500000
CREATE_INDEX=false
CREATE_MD5_FILE=false
GA4GH_CLIENT_SECRETS=client_secrets.json
It always tells me “Found 0 optical duplicate clusters.” However, I could check manually that I have some sequencer duplicates. For example, I have two reads which are:
K00201:25:H7HKGBBXX:2:2124:14387:26353 256 chr1 267802 3 24M1I25M * 0 0 CTTGGATGTTCGGGAAAGGGGGTTCTTTATCTAGGATCCTTGAAGCACCCAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ CC:Z:chr17 MD:Z:24A24 PG:Z:MarkDuplicates XG:i:1 NH:i:2 HI:i:0 NM:i:2 XM:i:1 XN:i:0 XO:i:1 CP:i:83229346 AS:i:14 XS:A: YT:Z:UU
K00201:25:H7HKGBBXX:2:2124:14976:27232 256 chr1 267802 3 24M1I25M * 0 0 CTTGGATGTTCGGGAAAGGGGGTTCTTTATCTAGGATCCTTGAAGCACCCAAFFFJJJFJJJJJJJJJJJJJ14 XS:A: YT:Z:UU
Both are on the same tile (2124) and their coordinates are: x=14387 and y=26353 for the first one, and x=14976 and y=27232 for the second one. They are duplicates and they are 1058 px far (<2500), so why are not they marked as optical duplicates? Where did I do mistake?
Thank you in advance for your help :smile:
From yfarjoun on 2016-05-17
These are secondary alignments. Can you show the primary alignments instead?
From shlee on 2016-05-17
Hi @Nebetbastet,
The tutorial as presented uses a coordinate sorted input BAM. MarkDuplicates, given this coordinate sorted input, will ignore supplementary/secondary alignments. That is, it does not mark them. Given the alignment records you’re showing us are marked with the `256` SAM flag as @yfarjoun points out, indicating they are secondary alignments, they are not considered by the tool. That you’re getting a `Found 0 optical duplicate clusters` count may at first be puzzling but can be rationalized. Do the primary alignments of these reads align to the same locus? If they do, and have matching start sites, are they then marked as duplicate? Remember that multimapping reads can be distributed by the aligner to different loci.
On a side note, if you care about marking duplicates for secondary alignments, Picard recently added a new feature to [MarkDuplicates](http://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates) that allows for this. I’ll update the tutorial details to reflect this. Basically, if you provide MarkDuplicates a queryname sorted BAM (which is something [SortSam](http://broadinstitute.github.io/picard/command-line-overview.html#SortSam) can do for you), then if a primary alignment is marked as duplicate (whether optical or other type of duplicate), then its (i) unmapped mate, (ii) secondary and/or (iii) supplementary alignment record will also get flagged as duplicate.
I hope this helps. Let us know what you find.
From Nebetbastet on 2016-05-18
Thank you very much for your answer and for the information you gave me about secondary alignments.
Actually, I think I understood what my problem was. I used single-end data (most of the projects in my team are single-end) and I just noticed Markduplicates needs paired-end data. I read the documentation too quickly and I was simply supposing Markduplicates could detect optical duplicates using both single-end and paired-end data. So my problem was quite trivial…
I just used it in paired-end data and I could detect “optical” duplicates :)
From shlee on 2016-05-18
@Nebetbastet
I question your conclusion that the single-ended nature of your data is the cause of any discrepancies. As far as I know, MarkDuplicates will take either single-end or paired-end data and should flag both optical and molecular duplicates for either. You yourself have said that you get many duplicates marked with your single-end RNA-Seq data.
There are implications for duplicate flagging for these types of data that you should keep in mind and I’m going to mention them here since you bring it up.
For paired-end data, the insert is considered for duplicate marking (both first and second reads are marked duplicate if the insert is considered duplicate). For single-end data, the reads alone are considered for duplicate marking since the data do not provide insert information. So for single-end data you’ll end up with more reads flagged as duplicate than if the same inserts had been paired. This artificially inflates the number of duplicates as defined in the conventional sense, i.e. insert duplicates. For example, if you had two inserts sized 200 bp and 300 bp and the reads from each mapped identically, then one of these reads would be marked as duplicate for the single-end data. If on the other hand data is paired, then neither insert would be marked as duplicate because it is obvious that the inserts are different and thus cannot be duplicates. I think this will impact single-end data’s estimated library complexity metric by artificially lowering it unless the tool has a different formula in calculating this to accommodate single-ended data. I’d have to check on this last bit.
From Nebetbastet on 2016-05-19
Actually, you are right. I just realized after I wrote it. Indeed, with my single-end data, I did find duplicates. But Markduplicates never found any optical duplicates! I tried with many samples in many datasets (and by testing several values of optical duplicate pixel distance) . Honestly, I cannot believe there is 0 optical duplicates in all these samples.
In addition, with the only two paired-end samples I tested, I found 12% and 6% of optical duplicates.
I wonder if there is not a problem with the software. I think I will contact Picard to ask them.
Thank you for your information about single-end reads, it’s good to have that in mind.
From yfarjoun on 2016-05-19
Actually, MarkDuplicates only gathers information for OpticalDuplicates on PAIRS. Note that the metric is called `READ_PAIR_OPTICAL_DUPLICATES` and the comments specifically states that this is only for pairs. While this could be changed, this would involve changing the metrics, and a strong case for that would need to be made.
From shlee on 2016-05-19
Thanks yfarjoun. That is good to know. Just to clarify further
Nebetbastet , just because optical duplicates are not counted, does not mean they are not flagged. MarkDuplicates flags all duplicates that fit the criteria then takes this pool for optical duplicate consideration. So we’ve learned that for single-end data, we do not segment out optical duplicates from the pool of marked duplicates for counting. For single-end data, the supposition is that we cannot apply the pair-orientation criteria that we apply to paired-end optical duplicates, and are thus doubly uncertain (uncertain of common insert, uncertain of common pair orientation), despite proximity, that a duplicate may be optical. So although they are marked as duplicate alongside other types of duplicates, they are not counted in the stdout nor in the metrics file.
I’d like to add that in our typical conservative approach to variant discovery, we don’t distinguish the duplicate types and discount all duplicates. I’m curious, @Nebetbastet, how are the optical duplicates important to your research?
From Nebetbastet on 2016-05-20
Thank you yfarjoun and
shlee. I did not know that Markduplicates did not count optical duplicates for single-end data on purpose. I thought there was a problem.
I needed it because my team has aquired a new sequencer (Hiseq4000). Previously, we worked with the Hiseq2500. With paired-end data, I could find that there is a proportion of “optical” duplicates 10-500 times stronger with the Hiseq4000. It’s an information we wanted to know.
In addition, as I analyze several NGS projects for reserachers, I need to control data quality. I am interested to use routinely a tool to control the % of duplicates due to sequencing. I think this information can be interesting if, for example, I meet a problem with the data. It can help to figure out what happened. Even if the estimation is uncertain, I think it could perhaps be useful to compare between projects/datasets.
In any case, thank you very much for your insightful answers.
From shlee on 2016-05-20
I’m glad we could be of help. Perhaps @yfarjoun has comments on your high duplicate rates.
From benjaminpelissie on 2016-05-25
-
From cooperjam on 2016-06-14
Hello, I have noticed that when I use Mark Duplicates with single end reads that I do not get the ROI table that I get when using paired end data. Is this a limitation of SE reads or a bug? The metrics file contains the standard line about # redundant reads, % dup etc but the histogram data isn’t present.
I found this thread on biostars suggesting that it might have to do with the read group. https://www.biostars.org/p/115044/
However I have tried with and without the read group info added and still no histogram.
From lhogstrom on 2016-06-14
Cooperjam, the return-on-investment projection is based the READ_PAIRS_EXAMINED and READ_PAIR_DUPLICATES entries reported in the metrics file. The tool does not currently support library complexity predictions based on single end reads.
From timk on 2016-07-05
… solved :)
From neva on 2016-07-14
Hi Soo Hee,
What kind of error are you trying to capture with a larger distance, ex amp cluster errors? Do you know of any way to definitively verify (via the sequence string e.g. instead of mapping position) that a duplicate is a sequencer duplicate and not a PCR duplicate? And one final question – have you considered counting the optical/sequencer duplicates across tiles? Couldn’t a read be a sequencer duplicate of one located just across the tile boundary on an adjacent tile?
Thanks for any insight!
From yfarjoun on 2016-07-15
The larger distance is for two things: 1. the pixels definitions on the HiSeqX (and possibly the 4000) has changed, requiring us to increase the distance. 2. an effect dubbed “pad-hopping” that happens during the ex-amp seems to cause templates to appear in nearby wells. the 2500 distance seems to capture most of the resulting effect.
I am not aware of a way to “definitively” distinguish between the two classes, but our analysis shows that this is good enough for the purpose of library-size estimation (especially given that the model for library-size estimation, is approximate as well)
we have considered looking at neighboring tiles, the problem is that each sequencer might have a different geometry, not only regarding what constitutes a neighboring tile, but also the size of each tile, and both of these are needed in order to do this properly. The added complication, together with the fact that it doesn’t seem to affect the result (estimated library size) and the fact that (as I alluded to before) the model for Library-size estimation is approximate, made us conclude that it isn’t worth the effort.
i hope this helps.
From xhe764 on 2016-08-04
Hi Soo Hee
Thank you for this tutorial and your answers to various questions. I’m having trouble to run MarkDuplicates on my RNA-seq data. The problem is MarkDuplicates program identifies some extremely large duplicate sets ranging from 3 millions to more than 7 millions and the output shows the program is stuck to run “OpticalDuplicateFinder compared” 1000 reads by 1000 reads and does not finish after 48 hours. Do you have any suggestions for dealing with this problem? Any comments will be highly appreciated. Xiao
From shlee on 2016-08-08
Hi Xiao (@xhe764),
You can skip optical duplicate finding by setting `READ_NAME_REGEX`=null. Let me know if this does not work for you.
From yfarjoun on 2016-08-08
should be `READ_NAME_REGEX=null`
From yfarjoun on 2016-08-08
Please note that without optical duplicate finding all the duplicates will be assumed to be non-optical, which, in turn, will cause the library size estimation to be less accurate.
From jinkeanlim on 2016-09-23
I have encountered with an error (below) when using the MarkDuplicates. No results were produced.
/appl/bio/picard/picard-2.6.0/bin/picard: line 5: 27651 Killed /usr/lib/jvm/java-1.8.0-oracle.x86_64/bin/java -Xmx32g -jar /appl/bio/picard/picard-2.6.0/picard/build/libs/picard.jar $@
Could some please help?
From shlee on 2016-09-30
Hi @jinkeanlim,
Can you please post the command that produced the error? Thanks.
From jinkeanlim on 2016-10-03
>
shlee said: > Hi
jinkeanlim,
>
> Can you please post the command that produced the error? Thanks.
Dear @shlee
The problem was solved. It was due to server restriction instead of potential bug of Picard tools.
Thanks for your time.
From mariel on 2016-10-06
Hello,
I am working with whole genome resequencing data for population genomic analyses.
I am marking and removing duplicates with MarkDuplicates.
I was wondering whether I should remove optical duplicates or keep them using READ_NAME_REGEX. Also if I am removing them, should I change the optical duplicate pixel distance as data were generated with the HiSeq X-ten?
Thank you,
Marie
From shlee on 2016-10-06
Hi @mariel, MarkDuplicates marks all duplicates whether optical or not.
From Sumudu on 2016-10-08
Hi,
I get a out of memory error when I try to run MarkDuplicates in my machine. I’m working on a stand-alone computer with RAM ~ 3GB and disk space ~900GB. Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04 LTS. My initial R1 and R2 FASTQ data sets each having size 1.4GB. I mapped to ref with BWA-mem.
When I check with “ java -XX:+PrintFlagsFinal -version “, MaxHeapSize shows ~~ 1015 021 568. Is this means that I can use -Xmx up to 1015G?? I used -Xmx35G previously when I run MarkDuplicates.
And kindly expalin how to set TMP_DIR in my case since my machine is not connected to any server? Is it the disk with 900GB?
Appreciate if you can give some advice.
Thank you.
Sumudu
From Geraldine_VdAuwera on 2016-10-08
Hi there, the xmx settings apply to the amount of ram memory, so you’re allocating far more than what your machine can provide. Simply put, you will not be able to run full scale work on your laptop you need a server or a cloud platform to do this work.
From mariel on 2016-10-08
Hi @shlee,
Thanks for your answer. When I run MarkDuplicates and merge several bam files (previously sorted), the output bam file should be sorted too, right?
Thanks,
Marie
From Sheila on 2016-10-12
@mariel
Hi Marie,
Yes, the output BAM file should be sorted.
-Sheila
From fabiodp on 2017-03-02
Hi everyone,
Have you any information about some criticisms or critical parameters to set in bowtie2 in order to properly found duplicates with MarkDuplicatesWithMateCigar?
Hope this is the right forum… :smile:
Thank you,
Fabio
From shlee on 2017-03-02
Sorry @fabiodp, bowtie2 is not a tool that we use in our workflows so we have no comments for you.
From shlee on 2017-03-06
I feel bad about my curt reply @fabiodp. It’s a result of my efforts to stay within scope as I have a tendency to go down rabbit holes. Let me just follow-up and say that you should not use MarkDuplicatesWithMateCigar with RNA-Seq data because the memory requirements are restrictive. Rather, we have a new tool, [UmiAwareMarkDuplicatesWithMateCigar](https://broadinstitute.github.io/picard/command-line-overview.html#UmiAwareMarkDuplicatesWithMateCigar) that should be better for RNA-Seq data. However, we have yet to suss out our recommendations for the tool parameters. I’m sorry I cannot be more helpful.
From fabiodp on 2017-03-08
Do not worry @shlee, no problem at all, but I really appreciate your kindness! Actually my are ChipSeq data so not RNAseq data, I will try the tool you suggested me. Just to update, I saw that, for same reason that I still don’t understand, MarkDuplicates worked out the duplication level in my both bowtie and bowitie2 alignments whereas MarkDuplicatesWithMateCigar does not. I guess it could be a problem with the cigar in bowtie sam outputs…
From beadoro on 2017-03-09
Hello,
I have a puzzling problem with formatting the command line for MarkDuplicates such that multiple input bam files get merged. I want to execute MarkDuplicates inside a loop over samples and the number of bam files per sample is not constant. So, for each sample I construct a string $INFILES that I place in the MarkDuplicates command line, for example:
```
echo $INFILES
INPUT=SK2665_2_reordered.bam INPUT=SK2665_1_reordered.bam
```
This follows the instructions above, but generates the following error:
```
[Thu Mar 09 04:21:04 CET 2017] picard.sam.markduplicates.MarkDuplicates INPUT=[SK2665_2_reordered.bam INPUT=SK2665_1_reordered.bam]
…
Cannot read non-existent file: /path/to/scratch/SK2665_2_reordered.bam INPUT=SK2665_1_reordered.bam
```
The files are definitely there (I checked). I’ve tried various permutations (using ‘INPUT=’ only once, followed a space/comma/comma-space separated list of files): same result. So, to move forward somehow I decided to use MergeBamFiles first and then input the merged file into MarkDuplicates. MergeBamFiles: same $INFILES spec, same error.
But when I place the command in the script without string variable and exactly as specified in the documentation:
```
java -jar picard.jar MergeSamFiles \ I=SK2665_1_reordered.bam \ I=SK2665_2_reordered.bam \ O=SK2665_merged.bam
```
it works. I am new to shell scripting and at a loss to understand what is going on here. It would be really useful if there was a way to construct MarkDuplicates commands inside the loop such that the number of input files per sample could vary.
From yfarjoun on 2017-03-09
can you do this
```
java -jar picard.jar MarkDuplicates $(echo $INFILES) OUTPUT=out….
```
From Sheila on 2017-03-09
@fabiodp
Hi Fabio,
We don’t have any experience with ChIP-seq data, but there are some other users who have been working with ChIP-seq data and GATK. You may find [this thread/user](http://gatkforums.broadinstitute.org/gatk/discussion/8990/remove-duplicate-reads-from-chip-seq-data) helpful.
-Sheila
From beadoro on 2017-03-09
@yfarjoun
I just tried this
```
INFILES=“INPUT=SK2665_2_reordered.bam INPUT=SK2665_1_reordered.bam”
java -jar picard.jar MarkDuplicates \ $(echo $INFILES) \ OUTPUT=dedupped.bam \
```
and it worked. Marvellous. Many thanks for your advice.
From prasundutta87 on 2017-03-20
Hi,
How important is it to remove duplicates when looking for allele specific expression from RNAseq data? Different papers write differently about this step. According to me, not removing duplicates is the way to go because highly expressed genes will get saturated with reads. Will removing duplicates actually affect variant calling for ASE analysis? And if at all we use MarkDuplicates, is it ok to use REMOVE_SEQUENCING_DUPLICATES=true, for only removing PCR duplicates rather than removing all kinds of duplicates?
From shlee on 2017-03-20
Hi @prasundutta87 ,
For RNA-Seq data, yes, we recommend disabling downstream tools’ duplicate read filter with ` -drf DuplicateRead` (drf=disable read filter). We do NOT recommend actually removing duplicate reads from your data, as this is a loss of information.
From prasundutta87 on 2017-03-20
So technically, I can easily go without using ‘markDuplicates’ right? or just marking duplicates (and not removing any kind of duplicates) should be the process? (Best practices for RNAseq says so), but for ASE,can this step can be completely avoided?
From shlee on 2017-03-20
@prasundutta87,
Allele-specific expression (ASE) is not a GATK workflow. It’s up to you to determine how best to use our tools for your aims.
Yes, for any process that requires comparing counts of reads as representing RNA expression, removing any reads from the analysis would be detrimental and counter to the aim.
I’m not sure if this is still the case—@Geraldine_VdAuwera will know for sure—certain GATK tools reject alignment files for which duplicates are unmarked.
From Geraldine_VdAuwera on 2017-03-20
Right, for ASE people typically don’t discount duplicates — either by not marking them in the first place or by setting the tool to ignore the marking. See [this paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4574606/) for more discussion.
From shlee on 2017-03-21
Thanks for this 2015 paper @Geraldine_VdAuwera. I’m going to have to read it. Skimming, looks like the relevant passage is:
> However, we observe consistent albeit infrequent signs of PCR artifacts in the Geuvadis AE data, especially affecting lowly covered sites — where duplicates are mostly true PCR duplicates, since saturation is unlikely. Removing duplicate reads reduces technical sources of AE at these sites, while having a minimal effect on highly covered, read-saturated SNPs (Figure S4e in Additional file 4). Thus, we suggest that removing duplicate reads is a good default approach for AE analysis, and it is implemented as a default in the GATK tool. However, it is important that the retained read is either chosen randomly or by base quality, and not by mapping score, so as not to bias towards the reference allele.
My thoughts — This approach is definitely relevant to samples with lower library complexity and less relevant for samples with high library complexity as would be obtained from PCR-free preps.
From prasundutta87 on 2017-03-28
Thank you for your thoughts, I had gone through this paper as well and stumbled about this para as well. I may need to check the relevancy of this approach based on the library prep. Thank you.
From prasundutta87 on 2017-03-28
And is there a way that what ever update I get in any gatk community (like this), I get an update in an email (I log in through my gmail account) ? I just revisited this discussion and found a new reply which I noticed after 8 days.
From Geraldine_VdAuwera on 2017-03-28
In the FAQs there is an article with instructions for setting notifications.
From prasundutta87 on 2017-03-28
Thanks..I changed my notification preferences..
From qtian on 2017-04-08
Hello Dear,
This tutorial is of great help for me. And I have a question about this part: if I marked duplicates follow the instructions of MarkDuplicates tools, and set REMOVE_DUPLICATES parameter to false. So the OUTPUT=6747_snippet_markduplicates.bam file keeps the duplicates. Can I remove the marked_duplicates later directly without re-markduplicates of 6747_snippet_markduplicates.bam? And how? May
“java -Xmx32G -jar picard.jar MarkDuplicates \
INPUT=6747_snippet_markduplicateswithmatecigar.bam \
OUTPUT=6747_snippet_removed_duplicateswithmatecigar.bam \
REMOVE_DUPLICATES=true“
work?
From yfarjoun on 2017-04-08
Your method would work, but would be computationally expensive, since it would be re-marking-duplicates.
If you have marked duplicates already, a faster approach would be to use
> samtools -F 1024 ……
see the samtools man page (http://www.htslib.org/doc/samtools.html) and the explain sam flags page (https://broadinstitute.github.io/picard/explain-flags.html)
From qtian on 2017-04-09
That’s helpful, thank you very much!
From prasundutta87 on 2017-11-08
Hi,
While removing duplicates, is it important to sort by query name rather than coordinate? Because reads having same query name arose from the same template DNA (correct me if I am wrong) and may be duplicates.
From yfarjoun on 2017-11-08
@prasundutta87, no. you can have the file in either queryname or coordinate order. The results (for primary, non-supplementary reads) will be identical in these cases.
if you provide a queryname sorted input file, the result will also have the secondary and supplementary reads marked together with their “canonical” reads.
From prasundutta87 on 2017-11-08
Thank you @yfarjoun . So when you say canonical reads, does that mean the original read from where the secondary and supplementary reads originated? Markduplicates will mark them as duplicates as they will have the same queryname and that is wrong..
From yfarjoun on 2017-11-08
no. that’s not what I mean….consider the following scenario:
READ1/1 primary, non-supplementary
READ1/2 primary, non-supplementary
READ1/2 supplementary
READ1/2 secondary
READ2/1 primary, non-supplementary
READ2/2 primary, non-supplementary
READ2/2 supplementary
READ2/2 secondary
assume that READ1 and READ2 are duplicates of each other, and that READ2 “should” be marked as a duplicate.
The difference between running MD in the two sort orders is that,
in coordinate sorted order, the secondary and the supplementary reads will not be marked as duplicates, while
in queryname order they will be marked as duplicates.
From prasundutta87 on 2017-11-08
Ok. Makes sense. Thank you for the explanation. I just wanted to know how important are supplementary reads duplication important from the point of variant calling? I am aware that In haplotypecaller, the NotPrimaryAlignmentFilter is switched on my default which does NOT filter out read records that are supplementary alignments. Just to point out, I am working with a non-model organism.
Furthermore, I used -M with bwa mem (which converts all supplementary reads flags to secondary), as instructed by an old piece of GATK documentation. I am aware now that the current version MarkDuplicates (2.14) can handle supplementary information as well now. My understanding is that I lost the supplementary information, which from a biological perspective, gives information about transclocation events and further downstream, I have compromised any possible structural variation studies with my data. Does it mean , I have to perform realignment again without -M?
From shlee on 2017-11-08
Hi @prasundutta87,
To be clear, GATK handles supplementary alignments. The `NotPrimaryAlignmentFilter` only applies to secondary alignments. The Best Practice Workflow that uses `bwa mem -M` is a reference implementation that showcases how we process samples for a particular reference assembly, namely human GRCh37. I believe we still use this workflow for those who request such processing.
The parameters of this variant calling workflow are conservative in that we limit a read from counting towards variant calls to one mapping location, if any. The `mem` in bwa mem refers to maximal exact matches and indicates a mode of alignment that tries to minimize potentially spurious multiple alignments. Hopefully, the deep coverage depth afforded by short read sequencing tech allows for inserts to distribute well across non-uniq locations. You should note that if the aligner cannot determine which of multiple mapping locations is better, it will assign a MAPQ of zero to the alignment it uses and our workflows ignore these MAPQ0 alignments as well.
The `-M` of `bwa mem -M`further flags the shorter split supplementary hits as secondary. In [this tutorial](https://gatkforums.broadinstitute.org/gatk/discussion/6483/how-to-map-and-clean-up-short-read-sequence-data-efficiently), I state:
> -M to flag shorter split hits as secondary.
> This is optional for Picard compatibility as MarkDuplicates can directly process BWA’s alignment, whether or not the alignment marks secondary hits. However, if we want MergeBamAlignment to reassign proper pair alignments, to generate data comparable to that produced by the Broad Genomics Platform, then we must mark secondary alignments.
Depending on your organism, what is your expectation for non-unique sequences in the reference that your sample’s insert size (if paired end sequencing or read length for single end sequencing) would be unable to unambiguously map? How important are such sequences to your research? Depending on the answer to such questions, supplementary alignments may more or less impact your results.
If this is a concern for your research, I suggest you [revert](https://gatkforums.broadinstitute.org/gatk/discussion/6484/how-to-generate-an-unmapped-bam-from-fastq-or-aligned-bam), realign and preprocess your reads in a manner that allows you to preserve relevant information and see how your calls differ from the previous calls and from any truthset you may have. If you will want to use supplementary alignments much like in alt-aware alignment, then see [here](https://gatkforums.broadinstitute.org/gatk/discussion/8017/how-to-map-reads-to-a-reference-with-alternate-contigs-like-grch38).
From yfarjoun on 2017-11-08
@shlee can you take it from here?
From prasundutta87 on 2017-11-09
@shlee ..thank you for the detailed explanation and sharing the links..as I am working with water buffalo, it still has a draft genome and many unplaced scaffolds…and there is also no truth set available for soft filtering..that is some details about my organism of interest..
Going ahead with the suggestion and checking the difference in the call sets between BAM files produced with and without -M, I am on my way to do such experiment. But a basic flagstat determined that there is only 0.5 % of secondary/supplementary reads present in the BAM files.
From shlee on 2017-11-09
@prasundutta87, the question then to ask is how are these 0.5% supplementary alignments distributed across the genome? Are they distributed more or less randomly or concentrated to specific loci such that they constitute a major fraction of the coverage for these loci? Are you interested in calling variants in these loci?
Again, remember GATK ignores secondary alignments. Last I checked (for GATK3) we are unable to disable the `NotPrimaryAlignmentFilter` filter.
From prasundutta87 on 2017-11-10
Thank you for suggesting this @shlee . I will look in into my secondary/supplementary alignment data and make an informed choice further.
From shlee on 2017-11-11
@prasundutta87,
Great you are investigating the details. If you feel that this becomes all too time consuming, remember that others in genomics have decided these supplementary/secondary alignments are not high priority for their research aims, i.e. did not pass some cost-benefit analysis. These research aims so far as I’m aware has been to catalogue common human variation or find rare or somatic mutations. I don’t mean to send you down some wild goose chase.
From shlee on 2018-03-09
I’d like to add one more comment @prasundutta87. It may be that the interesting results are to be had if you traverse down the less-traveled road.
From Lavanya on 2018-05-23
>
shlee said: > Hi
mariel, MarkDuplicates marks all duplicates whether optical or not.
Hi, I have the same doubt as Mariel.
Could you please let me know whether I can skip using “READ_NAME_REGEX” in my commandline hence the tool marks all the duplicates whether optical duplicates or not? Thanks.
From yfarjoun on 2018-05-23
if the only thing you want from MarkDuplicates is the duplication marking then you can skip READ_NAME_REGEX or even set it to null (READ_NAME_REGEX=null) on the commandline and the reads will be marked.
If you also want some of the library metrics (like library size and the ROI estimates in the metrics file) you’ll need to identify opticals since they do not “count” as duplicates for the purposes of library size estimation. If your read names are standard illumina formatted (5 or 7 fields separated by colons with the last 3 numeric) then you should not change READ_NAME_REGEX as it defaults to an optimized parser for the read names. If your read names are differently formatted, you’ll need to supply a regular expression for capturing these 3 fields (tile, x-coordinate, y-coordinate).
From abaano on 2018-09-14
Is there an option in either tool to ignore reads that are already marked as duplicates? What I am observing is that queryname sorted files processed with MarkDuplicates and sorted then merged with MarkDuplicatesWithMateCigar loose duplicates (same MarkDuplicates post merging). I am suspecting that his is due to the differences in marking dups for both tools (queryname vs coordinate sorting), but one would like to think that if a read is already marked as a dup, these could be an option to ignore there reads.
From shlee on 2018-09-17
Hi @abaano,
As far as I know, MarkDuplicates/WithMateCigar will ignore previous markings and mark duplicates afresh. There is a separate tool to unmark duplicates, [UnmarkDuplicates](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_UnmarkDuplicates.php), however, that’s not what you are asking.
Why are you choosing to merge BAMs post-MarkDuplicates yet again with MarkDuplicates/WithMateCigar? You can actually simultaneously MarkDups and merge RG BAMs by providing the files altogether to MarkDuplicates by specifying each file with `I=` (picard) or `-I` (gatk). I’ve done this with coordinate-sorted alignments and I believe this should also work with queryname-sorted alignment files. Please let us know if this is not the case.
From abaano on 2018-09-21
Hi @shlee
Thank you for your reply. In my workflow, I receive reads from multiple lanes/run so I process each fq pair stream query sorted bwa output to first pass dup marking and sort. Then I merge the files, which I intended to use MarkDuplicatesWithMC as you suggested using multiple INPUT. However, looking at duplicate count, reads that were marked as dups by samblaster are no longer marked by Picard hence the reduced dup count. What I was hooping to achieve is to perform dedup after merge to capture additional duplicates post merge, and hoped that the second dedup will be much faster since first round dups could be ignored. It looks like my assumption was wrong. I will test again with removing dups from the file and see if that will achieve the intended action, however, will not be ideal for implementation. Only if there was an option for PICARD to ignore reads already marked as duplicate.
So from what I understand, you are suggesting to stick with one dedup. What mode is the recommended or more reliable. Query-sorted or coordinate sorted dedup?
From shlee on 2018-09-21
Hi @abaano,
The results will be different when you MarkDuplicates on coordinate or queryname sorted alignments. The latter will additionally mark supplementary and secondary alignments. Which strategy you use is up to you. Will supplementary or secondary alignments be important to your analysis, e.g. as would be the case to call on alternate contigs in alt-aware GRCh38 alignments?
Duplicate marking by MarkDuplicates/WithMateCigar is at the insert level and therefore a more sophisticated strategy than marking at the single-read level. I am unfamiliar with samblaster but my understanding is that other duplicate marking tools only mark at the read level. The overall effect is that you should retain more (non-dup) reads with MarkDuplicates/WithMateCigar than read-level duplicate markers. The read-level strategy artificially marks more reads as duplicate when in fact they may be unique sources of information.
Each run through MarkDuplicates/WithMateCigar will mark duplicates afresh. The readgroup-level duplicate marking plus library-level duplicate marking (made possible by passing in all read-group BAMs for a sample, even if from multiple libraries) are each performed in certain production pipelines towards distinguishing lane-level artifacts versus library complexity. If you are not a sequencing facility, you should mostly only be interested in performing sample-level duplicate marking and aggregation.
If you want to perform read-group aggregation while retaining your original duplicate flags, there are numerous ways to just join the BAM files. If you want to comprehensively mark all types of duplicates, then use MarkDuplicats on all the readgroup BAMs for a sample jointly.
Please let me know if I can clarify any part of my answer.
From abaano on 2018-10-16
@shlee , thank you for the insightful answer. This is very useful. I do have a better understanding on how to utilize each tool.
From Praveen82 on 2018-11-06
Have a quick question. Do we need to use ‘fixmate’ before doing Picard Markduplicates?
From shlee on 2018-11-07
Hi @Praveen82,
Not necessarily. You can run your BAM through ValidateSamFile to see if there is a need. You may find [Article#7571](https://software.broadinstitute.org/gatk/documentation/article.php?id=7571) helpful in interpreting ValidateSamFile results.
Optical duplicate designation requires the same pair orientation.