How to Map and clean up short read sequence data efficiently

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by shlee

on 2015-11-23

If you are interested in emulating the methods used by the Broad Genomics Platform to pre-process your short read sequencing data, you have landed on the right page. The parsimonious operating procedures outlined in this three-step workflow both maximize data quality, storage and processing efficiency to produce a mapped and clean BAM. This clean BAM is ready for analysis workflows that start with MarkDuplicates.

Since your sequencing data could be in a number of formats, the first step of this workflow refers you to specific methods to generate a compatible unmapped BAM (uBAM, Tutorial#6484) or (uBAM^XT, Tutorial#6570 coming soon). Not all unmapped BAMs are equal and these methods emphasize cleaning up prior meta information while giving you the opportunity to assign proper read group fields. The second step of the workflow has you marking adapter sequences, e.g. arising from read-through of short inserts, using MarkIlluminaAdapters such that they contribute minimally to alignments and allow the aligner to map otherwise unmappable reads. The third step pipes three processes to produce the final BAM. Piping SamToFastq, BWA-MEM and MergeBamAlignment saves time and allows you to bypass storage of larger intermediate FASTQ and SAM files. In particular, MergeBamAlignment merges defined information from the aligned SAM with that of the uBAM to conserve read data, and importantly, it generates additional meta information and unifies meta data. The resulting clean BAM is coordinate sorted, indexed.

Geraldine_VdAuwera points out that there are many different ways of correctly preprocessing HTS data for variant discovery and ours is only one approach. So keep this in mind.

We present this workflow using real data from a public sample. The original data file, called Solexa-272222, is large at 150 GB. The file contains 151 bp paired PCR-free reads giving 30x coverage of a human whole genome sample referred to as NA12878. The entire sample library was sequenced in a single flow cell lane and thereby assigns all the reads the same read group ID. The example commands work both on this large file and on smaller files containing a subset of the reads, collectively referred to as snippet. NA12878 has a variant in exon 5 of the CYP2C19 gene, on the portion of chromosome 10 covered by the snippet, resulting in a nonfunctional protein. Consistent with GATK's recommendation of using the most up-to-date tools, for the given example results, with the exception of BWA, we used the most current versions of tools as of their testing (September to December 2015). We provide illustrative example results, some of which were derived from processing the original large file and some of which show intermediate stages skipped by this workflow.

We welcome feedback. Share your suggestions in the Comments section at the bottom of this page.

Jump to a section

Generate an unmapped BAM from FASTQ, aligned BAM or BCL

Mark adapter sequences using MarkIlluminaAdapters
1. Align reads with BWA-MEM and merge with uBAM using MergeBamAlignment A. Convert BAM to FASTQ and discount adapter sequences using SamToFastq B. Align reads and flag secondary hits using BWA-MEM C. Restore altered data and apply & adjust meta information using MergeBamAlignment D. Pipe SamToFastq, BWA-MEM and MergeBamAlignment to generate a clean BAM

Tools involved

Prerequisites

Installed Picard tools
Installed GATK tools
Installed BWA
Reference genome
Illumina or similar tech DNA sequence reads file containing data corresponding to one read group ID. That is, the file contains data from one sample and from one flow cell lane.

Download example data

To download the reference, open ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/b37/ in your browser. Leave the password field blank. Download the following three files (~860 MB) to the same folder: human_g1k_v37_decoy.fasta.gz, .fasta.fai.gz, and .dict.gz. This same reference is available to load in IGV.
I divided the example data into two tarballs: tutorial6483piped.tar.gz contains the files for the piped process and tutorial6483intermediate_files.tar.gz contains the intermediate files produced by running each process independently. The data contain reads originally aligning to a one Mbp genomic interval (10:96,000,000-97,000,000) of GRCh37. The table shows the steps of the workflow, corresponding input and output example data files and approximate minutes and disk space needed to process each step. Additionally, we tabulate the time and minimum storage needed to complete the workflow as presented (piped) or without piping.

The workflow reflects a lossless operating procedure that retains original sequencing read information within the final BAM file such that data is amenable to reversion and analysis by different means. These practices make scaling up and long-term storage efficient, as one needs only keep the final BAM file.

Download example snippet data to follow along the tutorial.

Related resources

See this tutorial to add or replace read groups or coordinate-sort and index a BAM.
See this tutorial for basic instructions on using the Integrative Genomics Viewer (IGV).
For collecting alignment summary metrics, see CollectAlignmentSummaryMetrics, CollectWgsMetrics and CollectInsertSizeMetrics. See Picard for metrics definitions.
See SAM flags to interpret SAM flag values.
Tutorial#2799 gives an example command to mark duplicates.

Other notes

When transforming data files, we stick to using Picard tools over other tools to avoid subtle incompatibilities.
For large files, (1) use the Java -Xmx setting and (2) set the environmental variable TMP_DIR for a temporary directory.
- java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \ TMP_DIR=/path/shlee
- In the command, the -Xmx8G Java option caps the maximum heap size, or memory usage, to eight gigabytes. The path given by TMP_DIR points the tool to scratch space that it can use. These options allow the tool to run without slowing down as well as run without causing an out of memory error. The -Xmx settings we provide here are more than sufficient for most cases. For GATK, 4G is standard, while for Picard less is needed. Some tools, e.g. MarkDuplicates, may require more. These options can be omitted for small files such as the example data and the equivalent command is as follows.
- java -jar /path/picard.jar MarkIlluminaAdapters
- To find a system's default maximum heap size, type java -XX:+PrintFlagsFinal -version, and look for MaxHeapSize. Note that any setting beyond available memory spills to storage and slows a system down. If multithreading, increase memory proportionately to the number of threads. e.g. if 1G is the minimum required for one thread, then use 2G for two threads.
When I call default options within a command, follow suit to ensure the same results.

1. Generate an unmapped BAM from FASTQ, aligned BAM or BCL

If you have raw reads data in BAM format with appropriately assigned read group fields, then you can start with step 2. Namely, besides differentiating samples, the read group ID should differentiate factors contributing to technical batch effects, i.e. flow cell lane. If not, you need to reassign read group fields. This dictionary post describes factors to consider and this post and this post provide some strategic advice on handling multiplexed data.

See this tutorial to add or replace read groups.

If your reads are mapped, or in BCL or FASTQ format, then generate an unmapped BAM according to the following instructions.

To convert FASTQ or revert aligned BAM files, follow directions in Tutorial#6484. The resulting uBAM needs to have its adapter sequences marked as outlined in the next step (step 2).
To convert an Illumina Base Call files (BCL) use IlluminaBasecallsToSam. The tool marks adapter sequences at the same time. The resulting uBAM^XT has adapter sequences marked with the XT tag so you can skip step 2 of this workflow and go directly to step 3. The corresponding Tutorial#6570 is coming soon.

back to top

2. Mark adapter sequences using MarkIlluminaAdapters

MarkIlluminaAdapters adds the XT tag to a read record to mark the 5' start position of the specified adapter sequence and produces a metrics file. Some of the marked adapters come from concatenated adapters that randomly arise from the primordial soup that is a PCR reaction. Others represent read-through to 3' adapter ends of reads and arise from insert sizes that are shorter than the read length. In some instances read-though can affect the majority of reads in a sample, e.g. in Nextera library samples over-titrated with transposomes, and render these reads unmappable by certain aligners. Tools such as SamToFastq use the XT tag in various ways to effectively remove adapter sequence contribution to read alignment and alignment scoring metrics. Depending on your library preparation, insert size distribution and read length, expect varying amounts of such marked reads.

java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \ I=6483_snippet_revertsam.bam \ O=6483_snippet_markilluminaadapters.bam \ M=6483_snippet_markilluminaadapters_metrics.txt \ #naming required TMP_DIR=/path/shlee #optional to process large files

This produces two files. (1) The metrics file, 6483_snippet_markilluminaadapters_metrics.txt bins the number of tagged adapter bases versus the number of reads. (2) The 6483_snippet_markilluminaadapters.bam file is identical to the input BAM, 6483_snippet_revertsam.bam, except reads with adapter sequences will be marked with a tag in XT:i:# format, where # denotes the 5' starting position of the adapter sequence. At least six bases are required to mark a sequence. Reads without adapter sequence remain untagged.

By default, the tool uses Illumina adapter sequences. This is sufficient for our example data.
Adjust the default standard Illumina adapter sequences to any adapter sequence using the FIVE_PRIME_ADAPTER and THREE_PRIME_ADAPTER parameters. To clear and add new adapter sequences first set ADAPTERS to 'null' then specify each sequence with the parameter.

We plot the metrics data that is in GATKReport file format using RStudio, and as you can see, marked bases vary in size up to the full length of reads.

See if you can revert 6483_snippet.bam, containing 279,534 aligned reads, to the unmapped 6383_snippet_revertsam.bam, containing 275,546 reads.

Below, we show a read pair marked with the XT tag by MarkIlluminaAdapters. The insert region sequences for the reads overlap by a length corresponding approximately to the XT tag value. For XT:i:20, the majority of the read is adapter sequence. The same read pair is shown after SamToFastq transformation, where adapter sequence base quality scores have been set to 2 (# symbol), and after MergeBamAlignment, which restores original base quality scores.

Unmapped uBAM (step 1)

Do you get the same number of marked reads? 6483_snippet marks 448 reads (0.16%) with XT, while the original Solexa-272222 marks 3,236,552 reads (0.39%).

After MarkIlluminaAdapters (step 2)

After SamToFastq (step 3)

After MergeBamAlignment (step 3)

back to top

3. Align reads with BWA-MEM and merge with uBAM using MergeBamAlignment

This step actually pipes three processes, performed by three different tools. Our tutorial example files are small enough to easily view, manipulate and store, so any difference in piped or independent processing will be negligible. For larger data, however, using Unix pipelines can add up to significant savings in processing time and storage.

The three tools we pipe are SamToFastq, BWA-MEM and MergeBamAlignment. By piping these we bypass storage of larger intermediate FASTQ and SAM files. We additionally save time by eliminating the need for the processor to read in and write out data for two of the processes, as piping retains data in the processor's input-output (I/O) device for the next process.

To make the information more digestible, we will first talk about each tool separately. At the end of the section, we provide the piped command.

back to top

3A. Convert BAM to FASTQ and discount adapter sequences using SamToFastq

Picard's SamToFastq takes read identifiers, read sequences, and base quality scores to write a Sanger FASTQ format file. We use additional options to effectively remove previously marked adapter sequences, in this example marked with an XT tag. By specifying CLIPPING_ATTRIBUTE=XT and CLIPPING_ACTION=2, SamToFastq changes the quality scores of bases marked by XT to two--a rather low score in the Phred scale. This effectively removes the adapter portion of sequences from contributing to downstream read alignment and alignment scoring metrics.

Illustration of an intermediate step unused in workflow. See piped command.

java -Xmx8G -jar /path/picard.jar SamToFastq \ I=6483_snippet_markilluminaadapters.bam \ FASTQ=6483_snippet_samtofastq_interleaved.fq \ CLIPPING_ATTRIBUTE=XT \ CLIPPING_ACTION=2 \ INTERLEAVE=true \ NON_PF=true \ TMP_DIR=/path/shlee #optional to process large files

This produces a FASTQ file in which all extant meta data, i.e. read group information, alignment information, flags and tags are purged. What remains are the read query names prefaced with the @ symbol, read sequences and read base quality scores.

For our paired reads example file we set SamToFastq's INTERLEAVE to true. During the conversion to FASTQ format, the query name of the reads in a pair are marked with /1 or /2 and paired reads are retained in the same FASTQ file. BWA aligner accepts interleaved FASTQ files given the -p option.
We change the NON_PF, aka INCLUDE_NON_PF_READS, option from default to true. SamToFastq will then retain reads marked by what some consider an archaic 0x200 flag bit that denotes reads that do not pass quality controls, aka reads failing platform or vendor quality checks. Our tutorial data do not contain such reads and we call out this option for illustration only.
Other CLIPPING_ACTION options include (1) X to hard-clip, (2) N to change bases to Ns or (3) another number to change the base qualities of those positions to the given value.

back to top

3B. Align reads and flag secondary hits using BWA-MEM

In this workflow, alignment is the most compute intensive and will take the longest time. GATK's variant discovery workflow recommends Burrows-Wheeler Aligner's maximal exact matches (BWA-MEM) algorithm (Li 2013 reference; Li 2014 benchmarks; homepage; manual). BWA-MEM is suitable for aligning high-quality long reads ranging from 70 bp to 1 Mbp against a large reference genome such as the human genome.

Aligning our snippet reads against either a portion or the whole genome is not equivalent to aligning our original Solexa-272222 file, merging and taking a new slice from the same genomic interval.
For the tutorial, we use BWA v 0.7.7.r441, the same aligner used by the Broad Genomics Platform as of this writing (9/2015).
As mentioned, alignment is a compute intensive process. For faster processing, use a reference genome with decoy sequences, also called a decoy genome. For example, the Broad's Genomics Platform uses an Hg19/GRCh37 reference sequence that includes Ebstein-Barr virus (EBV) sequence to soak up reads that fail to align to the human reference that the aligner would otherwise spend an inordinate amount of time trying to align as split reads. GATK's resource bundle provides a standard decoy genome from the 1000 Genomes Project.
BWA alignment requires an indexed reference genome file. Indexing is specific to algorithms. To index the human genome for BWA, we apply BWA's index function on the reference genome file, e.g. human_g1k_v37_decoy.fasta. This produces five index files with the extensions amb, ann, bwt, pac and sa.
- bwa index -a bwtsw human_g1k_v37_decoy.fasta

The example command below aligns our example data against the GRCh37 genome. The tool automatically locates the index files within the same folder as the reference FASTA file.

Illustration of an intermediate step unused in workflow. See piped command.

/path/bwa mem -M -t 7 -p /path/human_g1k_v37_decoy.fasta \ 6483_snippet_samtofastq_interleaved.fq > 6483_snippet_bwa_mem.sam

This command takes the FASTQ file, 6483_snippet_samtofastq_interleaved.fq, and produces an aligned SAM format file, 6483_snippet_unthreaded_bwa_mem.sam, containing read alignment information, an automatically generated program group record and reads sorted in the same order as the input FASTQ file. Aligner-assigned alignment information, flag and tag values reflect each read's or split read segment's best sequence match and does not take into consideration whether pairs are mapped optimally or if a mate is unmapped. Added tags include the aligner-specific XS tag that marks secondary alignment scores in XS:i:# format. This tag is given for each read even when the score is zero and even for unmapped reads. The program group record (@PG) in the header gives the program group ID, group name, group version and recapitulates the given command. Reads are sorted by query name. For the given version of BWA, the aligned file is in SAM format even if given a BAM extension.

We invoke three options in the command.

-M to flag shorter split hits as secondary. This is optional for Picard compatibility as MarkDuplicates can directly process BWA's alignment, whether or not the alignment marks secondary hits. However, if we want MergeBamAlignment to reassign proper pair alignments, to generate data comparable to that produced by the Broad Genomics Platform, then we must mark secondary alignments.
-p to indicate the given file contains interleaved paired reads.
-t followed by a number for the number of processor threads to use concurrently. Here we use seven threads which is one less than the total threads available on my Mac laptap. Check your server or system's total number of threads with the following command provided by KateN.
- getconf _NPROCESSORS_ONLN

In the example data, all of the 1211 unmapped reads each have an asterisk (*) in column 6 of the SAM record, where a read typically records its CIGAR string. The asterisk represents that the CIGAR string is unavailable. The several asterisked reads I examined are recorded as mapping exactly to the same location as their mapped mates but with MAPQ of zero. Additionally, the asterisked reads had varying noticeable amounts of low base qualities, e.g. strings of #s, that corresponded to original base quality calls and not those changed by SamToFastq. This accounting by BWA allows these pairs to always list together, even when the reads are coordinate-sorted, and leaves a pointer to the genomic mapping of the mate of the unmapped read. For the example read pair shown below, comparing sequences shows no apparent overlap, with the highest identity at 72% over 25 nts.

After MarkIlluminaAdapters (step 2)

Not all tools are amenable to piping and piping the wrong tools or wrong format can result in anomalous data.

Does the aligned file contain read group information?

After BWA-MEM (step 3)

After MergeBamAlignment (step 3)

back to top

3C. Restore altered data and apply & adjust meta information using MergeBamAlignment

MergeBamAlignment is a beast of a tool, so its introduction is longer. It does more than is implied by its name. Explaining these features requires I fill you in on some background.

Broadly, the tool merges defined information from the unmapped BAM (uBAM, step 1) with that of the aligned BAM (step 3) to conserve read data, e.g. original read information and base quality scores. The tool also generates additional meta information based on the information generated by the aligner, which may alter aligner-generated designations, e.g. mate information and secondary alignment flags. The tool then makes adjustments so that all meta information is congruent, e.g. read and mate strand information based on proper mate designations. We ascribe the resulting BAM as clean.

Specifically, the aligned BAM generated in step 3 lacks read group information and certain tags--the UQ (Phred likelihood of the segment), MC (CIGAR string for mate) and MQ (mapping quality of mate) tags. It has hard-clipped sequences from split reads and altered base qualities. The reads also have what some call mapping artifacts but what are really just features we should not expect from our aligner. For example, the meta information so far does not consider whether pairs are optimally mapped and whether a mate is unmapped (in reality or for accounting purposes). Depending on these assignments, MergeBamAlignment adjusts the read and read mate strand orientations for reads in a proper pair. Finally, the alignment records are sorted by query name. We would like to fix all of these issues before taking our data to a variant discovery workflow.

Enter MergeBamAlignment. As the tool name implies, MergeBamAlignment applies read group information from the uBAM and retains the program group information from the aligned BAM. In restoring original sequences, the tool adjusts CIGAR strings from hard-clipped to soft-clipped. If the alignment file is missing reads present in the unaligned file, then these are retained as unmapped records. Additionally, MergeBamAlignment evaluates primary alignment designations according to a user-specified strategy, e.g. for optimal mate pair mapping, and changes secondary alignment and mate unmapped flags based on its calculations. Additional for desired congruency. I will soon explain these and additional changes in more detail and show a read record to illustrate.

Consider what PRIMARY_ALIGNMENT_STRATEGY option best suits your samples. MergeBamAlignment applies this strategy to a read for which the aligner has provided more than one primary alignment, and for which one is designated primary by virtue of another record being marked secondary. MergeBamAlignment considers and switches only existing primary and secondary designations. Therefore, it is critical that these were previously flagged.

A read with multiple alignment records may map to multiple loci or may be chimeric--that is, splits the alignment. It is possible for an aligner to produce multiple alignments as well as multiple primary alignments, e.g. in the case of a linear alignment set of split reads. When one alignment, or alignment set in the case of chimeric read records, is designated primary, others are designated either secondary or supplementary. Invoking the -M option, we had BWA mark the record with the longest aligning section of split reads as primary and all other records as secondary. MergeBamAlignment further adjusts this secondary designation and adds the read mapped in proper pair (0x2) and mate unmapped (0x8) flags. The tool then adjusts the strand orientation flag for a read (0x10) and it proper mate (0x20).

In the command, we change CLIP_ADAPTERS, MAX_INSERTIONS_OR_DELETIONS and PRIMARY_ALIGNMENT_STRATEGY values from default, and invoke other optional parameters. The path to the reference FASTA given by R should also contain the corresponding .dict sequence dictionary with the same prefix as the reference FASTA. It is imperative that both the uBAM and aligned BAM are both sorted by queryname.

Illustration of an intermediate step unused in workflow. See piped command.

java -Xmx16G -jar /path/picard.jar MergeBamAlignment \ R=/path/Homo_sapiens_assembly19.fasta \ UNMAPPED_BAM=6383_snippet_revertsam.bam \ ALIGNED_BAM=6483_snippet_bwa_mem.sam \ #accepts either SAM or BAM O=6483_snippet_mergebamalignment.bam \ CREATE_INDEX=true \ #standard Picard option for coordinate-sorted outputs ADD_MATE_CIGAR=true \ #default; adds MC tag CLIP_ADAPTERS=false \ #changed from default CLIP_OVERLAPPING_READS=true \ #default; soft-clips ends so mates do not extend past each other INCLUDE_SECONDARY_ALIGNMENTS=true \ #default MAX_INSERTIONS_OR_DELETIONS=-1 \ #changed to allow any number of insertions or deletions PRIMARY_ALIGNMENT_STRATEGY=MostDistant \ #changed from default BestMapq ATTRIBUTES_TO_RETAIN=XS \ #specify multiple times to retain tags starting with X, Y, or Z TMP_DIR=/path/shlee #optional to process large files

This generates a coordinate-sorted and clean BAM, 6483_snippet_mergebamalignment.bam, and corresponding .bai index. These are ready for analyses starting with MarkDuplicates. The two bullet-point lists below describe changes to the resulting file. The first list gives general comments on select parameters and the second describes some of the notable changes to our example data.

Comments on select parameters

Setting PRIMARY_ALIGNMENT_STRATEGYto MostDistant marks primary alignments based on the alignment pair with the largest insert size. This strategy is based on the premise that of chimeric sections of a read aligning to consecutive regions, the alignment giving the largest insert size with the mate gives the most information.
It may well be that alignments marked as secondary represent interesting biology, so we retain them with the INCLUDE_SECONDARY_ALIGNMENTS parameter.
Setting MAX_INSERTIONS_OR_DELETIONS to -1 retains reads irregardless of the number of insertions and deletions. The default is 1.
Because we leave the ALIGNER_PROPER_PAIR_FLAGS parameter at the default false value, MergeBamAlignment will reassess and reassign proper pair designations made by the aligner. These are explained below using the example data.
ATTRIBUTES_TO_RETAIN is specified to carryover the XS tag from the alignment, which reports BWA-MEM's suboptimal alignment scores. My impression is that this is the next highest score for any alternative or additional alignments BWA considered, whether or not these additional alignments made it into the final aligned records. (IGV's BLAT feature allows you to search for additional sequence matches). For our tutorial data, this is the only additional unaccounted tag from the alignment. The XS tag in unnecessary for the Best Practices Workflow and is not retained by the Broad Genomics Platform's pipeline. We retain it here not only to illustrate that the tool carries over select alignment information only if asked, but also because I think it prudent. Given how compute intensive the alignment process is, the additional ~1% gain in the snippet file size seems a small price against having to rerun the alignment because we realize later that we want the tag.
Setting CLIP_ADAPTERS to false leaves reads unclipped.
By default the merged file is coordinate sorted. We set CREATE_INDEX to true to additionally create the bai index.
We need not invoke PROGRAM options as BWA's program group information is sufficient and is retained in the merging.
As a standalone tool, we would normally feed in a BAM file for ALIGNED_BAM instead of the much larger SAM. We will be piping this step however and so need not add an extra conversion to BAM.

Description of changes to our example data

MergeBamAlignment merges header information from the two sources that define read groups (@RG) and program groups (@PG) as well as reference contigs.

Tags are updated for our example data as shown in the table. The tool retains SA, MD, NM and AS tags from the alignment, given these are not present in the uBAM. The tool additionally adds UQ (the Phred likelihood of the segment), MC (mate CIGAR string) and MQ (mapping quality of the mate/next segment) tags if applicable. For unmapped reads (marked with an * asterisk in column 6 of the SAM record), the tool removes AS and XS tags and assigns MC (if applicable), PG and RG tags. This is illustrated for example read H0164ALXX140820:2:1101:29704:6495 in the BWA-MEM section of this document.Original base quality score restoration is illustrated in step 2.

The example below shows a read pair for which MergeBamAlignment adjusts multiple information fields, and these changes are described in the remaining bullet points.

MergeBamAlignment changes hard-clipping to soft-clipping, e.g. 96H55M to 96S55M, and restores corresponding truncated sequences with the original full-length read sequence.
The tool reorders the read records to reflect the chromosome and contig ordering in the header and the genomic coordinates for each.
MergeBamAlignment's MostDistant PRIMARY_ALIGNMENT_STRATEGY asks the tool to consider the best pair to mark as primary from the primary and secondary records. In this pair, one of the reads has two alignment loci, on contig hs37d5 and on chromosome 10. The two loci align 115 and 55 nucleotides, respectively, and the aligned sequences are identical by 55 bases. Flag values set by BWA-MEM indicate the contig hs37d5 record is primary and the shorter chromosome 10 record is secondary. For this chimeric read, MergeBamAlignment reassigns the chromosome 10 mapping as the primary alignment and the contig hs37d5 mapping as secondary (0x100 flag bit).
In addition, MergeBamAlignment designates each record on chromosome 10 as read mapped in proper pair (0x2 flag bit) and the contig hs37d5 mapping as mate unmapped (0x8 flag bit). IGV's paired reads mode displays the two chromosome 10 mappings as a pair after these MergeBamAlignment adjustments.
MergeBamAlignment adjusts read reverse strand (0x10 flag bit) and mate reverse strand (0x20 flag bit) flags consistent with changes to the proper pair designation. For our non-stranded DNA-Seq library alignments displayed in IGV, a read pointing rightward is in the forward direction (absence of 0x10 flag) and a read pointing leftward is in the reverse direction (flagged with 0x10). In a typical pair, where the rightward pointing read is to the left of the leftward pointing read, the left read will also have the mate reverse strand (0x20) flag.

Comparing 6483_snippet_bwa_mem.sam and 6483_snippet_mergebamalignment.bam, we see the numberunmapped reads remains the same at 1211, while the number of records with the mate unmapped flag increases by 1359, from 1276 to 2635. These now account for 0.951% of the 276,970 read records.

After BWA-MEM alignment

Two distinct classes of mate unmapped read records are now present in our example file: (1) reads whose mates truly failed to map and are marked by an asterisk * in column 6 of the SAM record and (2) multimapping reads whose mates are in fact mapped but in a proper pair that excludes the particular read record. Each of these two classes of mate unmapped reads can contain multimapping reads that map to two or more locations.

For 6483_snippet_mergebamalignment.bam, how many additional unique reads become mate unmapped?

After MergeBamAlignment

back to top

3D. Pipe SamToFastq, BWA-MEM and MergeBamAlignment to generate a clean BAM

We pipe the three tools described above to generate an aligned BAM file sorted by query name. In the piped command, the commands for the three processes are given together, separated by a vertical bar |. We also replace each intermediate output and input file name with a symbolic path to the system's output and input devices, here /dev/stdout and /dev/stdin, respectively. We need only provide the first input file and name the last output file.

Before using a piped command, we should ask UNIX to stop the piped command if any step of the pipe should error and also return to us the error messages. Type the following into your shell to set these UNIX options.

set -o pipefail

Overview of command structure

[SamToFastq] | [BWA-MEM] | [MergeBamAlignment]

Piped command

java -Xmx8G -jar /path/picard.jar SamToFastq \ I=6483_snippet_markilluminaadapters.bam \ FASTQ=/dev/stdout \ CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 INTERLEAVE=true NON_PF=true \ TMP_DIR=/path/shlee | \ /path/bwa mem -M -t 7 -p /path/Homo_sapiens_assembly19.fasta /dev/stdin | \ java -Xmx16G -jar /path/picard.jar MergeBamAlignment \ ALIGNED_BAM=/dev/stdin \ UNMAPPED_BAM=6383_snippet_revertsam.bam \ OUTPUT=6483_snippet_piped.bam \ R=/path/Homo_sapiens_assembly19.fasta CREATE_INDEX=true ADD_MATE_CIGAR=true \ CLIP_ADAPTERS=false CLIP_OVERLAPPING_READS=true \ INCLUDE_SECONDARY_ALIGNMENTS=true MAX_INSERTIONS_OR_DELETIONS=-1 \ PRIMARY_ALIGNMENT_STRATEGY=MostDistant ATTRIBUTES_TO_RETAIN=XS \ TMP_DIR=/path/shlee

The piped output file, 6483_snippet_piped.bam, is for all intensive purposes the same as 6483_snippet_mergebamalignment.bam, produced by running MergeBamAlignment separately without piping. However, the resulting files, as well as new runs of the workflow on the same data, have the potential to differ in small ways because each uses a different alignment instance.

Counting the number of mate unmapped reads shows that this number remains unchanged for the two described workflows. Two counts emitted at the end of the process updates, that also remain constant for these instances, are the number of alignment records and the number of unmapped reads.

INFO 2015-12-08 17:25:59 AbstractAlignmentMerger Wrote 275759 alignment records and 1211 unmapped reads.

back to top

Some final remarks

We have produced a clean BAM that is coordinate-sorted and indexed, in an efficient manner that minimizes processing time and storage needs. The file is ready for marking duplicates as outlined in Tutorial#2799. Additionally, we can now free up storage on our file system by deleting the original file we started with, the uBAM and the uBAM^XT. We sleep well at night knowing that the clean BAM retains all original information.

We have two final comments (1) on multiplexed samples and (2) on fitting this workflow into a larger workflow.

For multiplexed samples, first perform the workflow steps on a file representing one sample and one lane. Then mark duplicates. Later, after some steps in the GATK's variant discovery workflow, and after aggregating files from the same sample from across lanes into a single file, mark duplicates again. These two marking steps ensure you find both optical and PCR duplicates.

For workflows that nestle this pipeline, consider additionally optimizing java jar's parameters for SamToFastq and MergeBamAlignment. For example, the following are the additional settings used by the Broad Genomics Platform in the piped command for very large data sets.

java -Dsamjdk.buffer_size=131072 -Dsamjdk.compression_level=1 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx128m -jar /path/picard.jar SamToFastq ... java -Dsamjdk.buffer_size=131072 -Dsamjdk.use_async_io=true -Dsamjdk.compression_level=1 -XX:+UseStringCache -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx5000m -jar /path/picard.jar MergeBamAlignment ...

I give my sincere thanks to Julian Hess, the GATK team and the Data Sciences and Data Engineering (DSDE) team members for all their help in writing this and related documents.

back to top

Updated on 2017-01-19

From everestial007 on 2016-03-08

Quite a good and comprehensive workflow ! Just out of curiosity – if we follow this Broad production pipeline should we still cleanSam and fixMateinformation after aligment (BWA) or at any other stages.

From Geraldine_VdAuwera on 2016-03-08

You shouldn’t need to as MergeBamAlignments will do all that cleanup for you.

From everestial007 on 2016-03-08

Thanks @Geraldine_VdAuwera . :smile:

From jemorg on 2016-03-11

Hi, do you have an idea when tutorial 6570 will be online? Thanks

From Sheila on 2016-03-14

@jemorg

Hi,

Can you tell us the exact name of the tutorial? Perhaps that is an outdated/archived tutorial that has been replaced.

Thanks,

Sheila

From shlee on 2016-03-14

Hi @jemorg,

For now, please refer to Picard’s documentation of [IlluminaBasecallsToSam](http://broadinstitute.github.io/picard/command-line-overview.html#IlluminaBasecallsToSam). I can’t say when Tutorial#6570 will be written.

From EADG on 2016-03-17

@shlee Ty this is quite a nice tutorial!

At the moment I playing around with paired-end amplicon data. I want to clean up my aligned bam-file with MergeBamAligment. I see in the manual [broadinstitute.github.io/picard/command-line-overview.html#MergeBamAlignment](http://broadinstitute.github.io/picard/command-line-overview.html#MergeBamAlignment) that MergeBamAligment has an option to enter the first and the second read of pair as aligned bamfiles. But no option to to deliver the two bam-files with the unmapped reads.

Now my Question :)

Should I merge the two unmapped bam-file into one ? And use the merged bamfile as unmapped bam file ?

Or should i run MergeBamAligment twice, with the unmapped.bam for first and the second reads of pair ?

Ty!

Greetings EADG

From shlee on 2016-03-18

Hi @EADG ,

Merge the two unmapped BAMs (uBAMs) into one before running MergeBamAlignment. MergeBamAlignment looks for read pairs in the uBAM and will error if it cannot find the second read from a pair. To merge the two uBAMs, see [MergeSamFiles](http://broadinstitute.github.io/picard/command-line-overview.html#MergeSamFiles). Be sure that MergeSamFiles does not modify your read group IDs and is set to queryname sort.

Let me know if you encounter problems.

Soo Hee

From EADG on 2016-03-22

@shlee Ty for your help! I now use the FastqToSam command to create the uBAM-File, with the F1 and F2 option. It seems to running fine…just waiting for my pipeline to finish the job.

From shlee on 2016-03-22

@EADG. Thanks for sharing your FastqToSam solution. Glad it works.

From daverdin on 2016-04-07

Hi @shlee

I’m following this tutorial but I’m stuck on the MergeBamAlignment. As you’ve said before, it looks for read pairs in the uBAM and will error if it doesn’t find the second read from a pair. When I try to use the MergeBamAlignment command, it doesn’t output any file. I’ve looked at different potential problem and more particularly at the previous comment.. but no luck. However when I tried FastqToSam to combine my reads I got an error message telling me that my paired reads do not have the same names. The paired reads are named “read1_1” and “read1_2” and do not have the /1 and /2 that I suppose should be present. Do you think that could be my problem? If so is there any command to change the way my reads are named?

thanks a lot!

Guillaume

From shlee on 2016-04-08

Hi @daverdin,

MergeBamAlignment identifies read pair mates by their identical read names. So I think you are correct to try to resolve what I assume are your different pair read names using FastqToSam. Within the FastqToSam code we find the passage:

> Paired reads must either have the exact same read names or they must contain at least one “/” and the First pair read name must end with “/1” and second pair read name ends with “/2”. The baseName (read name part before the /) must be the same for both read names.

So I believe your read names must use `/1` and `/2` instead of your `_1` and `_2` for FastqToSam to recognize them. Also, be sure your read names do not contain blanks. Finally, the [FastqToSam documentation ](http://broadinstitute.github.io/picard/command-line-overview.html#FastqToSam) shows that you need to set the `STRIP_UNPAIRED_MATE_NUMBER` argument to true to make sure the `/1` and `/2` do not carry over to the output file read names.

In terms of an easy way to change your reads, I found this [forum post from 2012](http://seqanswers.com/forums/showthread.php?t=18371) that gives a variety of potential solutions. I hope you find a solution that works.

From Geraldine_VdAuwera on 2016-04-08

For what it’s worth I think Picard should accept either convention. @daverdin, I would encourage you to put in a feature request in the Picard repository on github to make this a configurable option.

From daverdin on 2016-04-11

Hello shlee, Geraldine_VdAuwera

Thanks for your responses, I’ve layed with my data and found that the _1 and _2 were indeed the problem in my files..

I replaced the end of each read names (sed ‘s/_2$/\/2/’ Sample.2.fq > Sample_test2.fq) and was able to obtain the desired output. I just need to verify that my command didn’t remove anything and I should be good to continue.. I will also write a request on github.

thanks a lot for your help..

From medhat on 2016-04-30

Hi, I was following the [GATK Best Practice (howto) Map and mark duplicates](http://gatkforums.broadinstitute.org/gatk/discussion/2799#latest) But I did not catch this MergeBamAlignment part so here is what I did;

Begin from fastq file “was processed using Prinseq to filter and trim sequence” —> then was mapped to the reference using best practice work follow as suggested BWA mem —> then sorted , and used picard to Mark Duplicates.

But now I found that I did not do MergeBamAlignment as suggested by this tutorial, shall I rerun every thing or there is better option?!

Thanks in advance.

From shlee on 2016-05-02

Hi @medhat,

In the introduction to this tutorial we mention that there are various different ways of correctly pre-processing HTS data for variant discovery, of which this tutorial is only one. So depending on your study design and aims, and any alternative pre-processing that you’ve already done for your samples, e.g. adding in read groups and MC tags and converting hardclips to softclips, you may find this particular pre-processing workflow redundant. We provide this tutorial workflow as a guide to represent what the Broad Genomics Platform does in their pre-processing and to highlight particular considerations that were previously undocumented, i.e. MergeBamAlignment’s features.

If you can tell us more about your “filter and trim” step, which I do not believe is part of our best practice workflows, and your study design, then perhaps we can better answer your question. For example, GATK workflows take base quality into account so filtering reads and trimming ends isn’t part of our workflow as explained in [this discussion](http://gatkforums.broadinstitute.org/gatk/discussion/2957/read-trimming). If you’re new to our workflows, I would suggest sticking with what is outlined in our [Best Practice Guide](https://www.broadinstitute.org/gatk/guide/best-practices).

I hope this is helpful.

Soo Hee

From silks225 on 2016-05-10

Hi,

I was wondering if I could get some clarification. I have been following this tutorial and using the data snippets provided. I converted the .fq to uBAM using tutorial#6484 however I get an error when I MergeBamAlignment which having done some reading I think points to the bam not being sorted correctly?

I think my script is good as if I changed the unmapped bam I generate to the reverted BAM provided in the bundle it works.

Exception in thread “main” java.lang.IllegalStateException: Aligned record iterator (HWI-ST970:586:C2PY6ACXX:1:1115:10000:27368) is behind t

he unmapped reads (HWI-ST970:586:C2PY6ACXX:1:1115:10000:3435)

then a load of nullPointerExceptions

I just wondered if this was known and should I be sorting the output from BWA before merging the BAM files? I was using the piped command.

Many thanks

From medhat on 2016-05-10

Thanks A lot this was very helpful

From shlee on 2016-05-10

Hi @silks225. I’m looking into this and will get back to you hopefully within the week.

From silks225 on 2016-05-10

Thanks @shlee your advice would be much appreciated

From shlee on 2016-05-10

Hi @silks225,

First, I’ve updated [Option A of Tutorial#6484](http://gatkforums.broadinstitute.org/gatk/discussion/6484#optionA). You’ll notice now the command takes two FASTQ files (`FASTQ` and `FASTQ2`) and the [tutorial_6484_FastqToSam.tar.gz](https://drive.google.com/open?id=0BzI1CyccGsZiUXVNT3hsNldvUFk) download contains two FASTQ files that match the command. Please use these files and the updated tutorial instructions. It turns out FastqToSam is unable to process an interleaved paired FASTQ file. I apologize for this and thank you for bringing this to my attention.

Second, be sure to use the commands as shown as they preserve the default queryname sorting of the output files for Tutorial#6484 and Tutorial#6483. All of the commands take queryname sorted records. Only the MergeBamAlignment step produces coordinate sorted records.

Let me know if you encounter additional errors.

From silks225 on 2016-05-11

Great, thanks @shlee for the clarification that makes a lot of sense I will give it ago.

From fabiodp on 2016-05-31

Hi I was wondering why the final “clean bam” produced with MergeBamAlignment takes as unmapped bam the initial bam file and not the markilluminaadapters bam. It could be useful to preserve through any following steps the information about reads containing adapters.

Thanks,

By the way excellent “how to”, very useful, thanks!!!

From shlee on 2016-06-01

@fabiodp —Yes, certainly if you will use the XT tag downstream, by all means use the output of MarkIlluminaAdapters as the uBAM in the MergeBamAlignment step. Thank you for the complement.

From timk on 2016-07-10

Hi,

Thanks for the great tutorial. I have a question: I’m trying to reproduce your tutorial results with the provided data. I ran into the problem that my resulting merged bam includes reads that are marked with the “XT” tag. However, a diff with your files show that your marked_adapters.bam does include marked reads but your merged final result file “6483_snippet_piped.bam” does not. But shouldn’t there still be marked reads in the result file after merging with the bwa things? I thought the “CLIP_ADAPTERS=false” stops picard from hard clipping reads and not soft clipping?

Cheers,

Tim

From shlee on 2016-07-11

Hi Tim (@timk),

Yes the workflow in the tutorial uses the BAM from the RevertSam step as the unmapped BAM and so we do not expect to carry-over the XT tag. However, as I replied to @fabiodp, you can also use the BAM from the MarkIlluminaAdapters step as the unmapped BAM so as to carry-over the XT tag. You’ll have to add `ATTRIBUTES_TO_RETAIN=XT` to the MergeBamAlignment command. I agree this tag will be useful, e.g. in interpreting deviant reads arising from 3’ adapter sequence in a pileup.

You bring up a good point with `CLIP_ADAPTERS=false` option. This MergeBamAlignment option, if set to `true`, only soft-clips the adapter sequence defined by the XT tag. There is no option to hard-clip adapter sequence with the Picard tool. Because downstream steps in our pre-processing workflows, e.g. MarkDuplicates or HaplotypeCaller, still use soft-clipped bases, whether or not adapter sequences are soft-clipped, results are the same in that any adapter sequence is included in consideration. Remember the MergeBamAlignment step changes BWA hard-clips to soft-clips in the merging so all types of softclips are present in the data. Presumably, reads with additional adapter sequences will be minor and these reads will mostly be discounted from contributing to variant calls in that HaplotypeCaller requires support for paths in its haplotypes graph from at least two supporting reads.

As far as I know, if you want to hard-clip these adapter sequences, you will have to do this with an outside tool or with your own script and use this hard-clipped file as the unmapped BAM. The MarkIlluminaAdapter step allows you to detect the extent of adapter sequences so you can either (i) continue with the workflow knowing the extend of 3’ adapters is negligible for your dataset, (ii) go and clip moderate 3’ adapters to start the workflow anew with a trimmed dataset or (iii) decide to remake your library or modify your library prep protocol because the sequences derived from the extensive small insert sizes make for a less informative dataset.

I hope this answers your question.

From timk on 2016-07-13

Hi Shlee,

Sorry for reposting and thanks for the extended answer.

Cheers,

Tim

From MUHAMMADSOHAILRAZA on 2016-07-13

@shlee

Hi,

i tried to reproduce results with same command line, starting from same input tutorial file “6483_snippet.bam” and reference sequence. After RevertSam -> MarkilluminaAdapters, piping and/or by following the steps (FASTqtoSam->BWA-MEM->MergeBamAlignments) separately, for both instances the number of alignment records and the number of unmapped reads are same.

INFO 2016-07-13 17:51:49 AbstractAlignmentMerger Wrote 275774 alignment records and 1198 unmapped reads. However, these counts are different than above mentioned one. i am using bwa-0.7.12-r1039 and picards-2.2.4.

I guess this might be due to change in bwa software version, What you say?

could you please explain how you count the number of reads for RevertSam/MarkilluminaAdapters or in intermediate sam/bam(s) ? I wonder, would it be a software version difference or somewhere else..

Thanks!

From shlee on 2016-07-13

Hi again @MUHAMMADSOHAILRAZA,

Yes, the tutorial uses BWA v 0.7.7.r441, which the Broad Genomics Platform has used in their production workflows. This is a different version of BWA than yours. It’s possible this difference is the source of the different number of alignment records, but I cannot say for sure.

It’s unclear to me what you mean by “for both instances the number of alignment records and the number of unmapped reads are same” given RevertSam/MarkIlluminaAdapters doesn’t have aligned records.

You can count number of records with `samtools view -c`. I hope you satisfy your curiosity and are able to answer your own questions.

From MUHAMMADSOHAILRAZA on 2016-07-14

@shlee

Hi,

i mean for “both instances” are the results that came out with “PIPING” and running the scripts altogether (instance 1) and result that came out by running the scripts (SamtoFastq, Bwa-mem, MergeBamAlignments) separately (instance 2).

For these both instances, the number of mapped and unmapped reads were matched (but differs with your results) , i.e. INFO 2016-07-13 17:51:49 AbstractAlignmentMerger Wrote 275774 alignment records and 1198 unmapped reads

On the other hand, i checked the read counts for my own generated RevertSam and MarkIlluminaAdapters files, they were the same as mentioned above (275546 all records including XT, and 448 records that only marked with XT). bwa-mem read-mapping is the only difference that leads to different number of alignment records. If this is the case then i think using recent bwa version is more beneficial.

From MUHAMMADSOHAILRAZA on 2016-07-14

@shlee

Hi,

Additionally, i am also curious about one coordinate-sorted file provided in tutorial’s bundle (“/tutorial_6483_intermediate_files/6483_snippet_bwa_mem_coordinate.bam”):

As we have both “6483_snippet_revertsam.bam” and “6483_snippet_bwa_mem.sam” files are sorted by ‘queryname’ that would be sufficient for running MergeBamAlignments, So i wonder at which step and with what purpose this file (“/tutorial_6483_intermediate_files/6483_snippet_bwa_mem_coordinate.bam”) was generated ?

Thanks!

From shlee on 2016-07-14

@MUHAMMADSOHAILRAZA That’s great you found the reason for the difference is the BWA version. And yes, if you have a choice per experiment set, it is always good to use the latest tool version that offers the features you need.

I included the coordinate sorted alignment file for those interested in dissecting differences before and after MergeBamAlignment, e.g. using IGV to recapitulate the view shown in the IGV screenshot I include in the tutorial. I do not document how to generate this coordinate-sorted file, but rather I just included the file. If you need to know, you can generate the same file with a Picard SortSam conversion.

From MUHAMMADSOHAILRAZA on 2016-07-17

@shlee

Hi again,

I downloaded aligned BAM files per sample for already published genomes of human populations. In Header section i saw RG record something like this:

@RG ID:LP6005441-DNA_A09 SM:LP6005441-DNA_A09

having information for only RGID and RGSM. But to reproduce the results with GATK best practices, I think i have to replace the read group (RG) information.

1. At what point in your opinion it’s best to replace RG information, before starting the whole process cleanup “unmapping the aligned BAMs” or “after following cleaning BAMs after MergeBamAlignments” ?

Secondly, by checking the reads information in a BAM file, it appears that the data sequenced on multiple lanes, i.e.:

HS2000-630_102:4:2115:1889:70619 HS2000-630_102:3:2311:13151:38215 HS2000-630_102:2:2315:18670:41735

2. Is it okay to assign the same RGIDs (As i mentioned earlier that authors did!) for the reads run on multiple lanes? could you please suggest any way to clean up this type of BAMs to deal with such situation, so i can Markduplicates and run BQSR procedures correctly.

Thanks!

From shlee on 2016-07-18

Hi Muhammad ( @MUHAMMADSOHAILRAZA ),

As you surmise, our tools require more information for the read group field than what is present in your downloaded BAM files. These are specified in the Dictionary entry #6472 on read groups [here](http://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups). Your unaligned BAM, the one you use in MergeBamAlignment, should have the correct read group information and read group tags for each read. Otherwise, you won’t be able to proceed with using our tools. You can use [AddOrReplaceReadGroups](https://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups) to accomplish this.

Ideally, you should represent each lane with a unique read group ID and appropriate Library and Sample identifiers, e.g. the same for these if they are indeed the same library prep and sample. Pooling different lanes of the same sample with the same RGID is NOT good practice. I could go on but I think your line of questioning shows you already understand the implications and are only looking for some confirmation. You’ll find the same answer across our forum if you search using your question.

From MUHAMMADSOHAILRAZA on 2016-07-19

@shlee

Hi, Is there any easy way to split the merged BAM to multiple BAMs with respect to Flowcell lanes?

Thanks!

From MUHAMMADSOHAILRAZA on 2016-07-19

i cannot access my profile properly or edit my comments and also cannot post a new question, please check my profile as well.

Thanks!

From shlee on 2016-07-19

@MUHAMMADSOHAILRAZA, Our tools are read group aware and so they will process your multi-readgroup merged sample level BAM appropriately. If there is some reason you still need to subset reads, then I am aware of two tools that can do this. First is Picard’s [RevertSam](https://broadinstitute.github.io/picard/command-line-overview.html#RevertSam); use the `OUTPUT_BY_READGROUP` option. Second is GATK’s [PrintReads](https://www.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_readutils_PrintReads.php); you can exclude/include reads by `—platform`, `—readGroup` or `—sample_file` or `—sample_name`.

I’ll get back to you about your issues accessing your account. If you can do your bit and try again later and let us know if you’re still having issues, we can rule out some possibilities.

From MUHAMMADSOHAILRAZA on 2016-07-20

@shlee

Hi,

Thanks for response! i am aware of these options and tools that you mentioned above, but i was looking for something that can edit read group information (especially IDs) if we have same RG tag for multiple lanes merged in single BAM something like this:

@RG ID:LP6005441-DNA_A09 SM:LP6005441-DNA_A09

So i thought the possible solution might be:

1. Split the BAM file into multiple BAMs per lane > then Assign correctly or replace read groups

2. Or just to replace the read groups in BAM by just reading the lane information (directly in one run).

Thanks!

From Geraldine_VdAuwera on 2016-07-20

I believe Picard AddOrReplaceReadGroup requires you to separate out the reads into single bams per read group.

From MUHAMMADSOHAILRAZA on 2016-07-28

@shlee

Hi again,

In case of large BAM files (i.e. per sample per bAM) approximately 100Gb, are the following additional optimizing java jar’s parameters are sufficient while piping the scripts together at step 3 (above)?

java -Dsamjdk.buffer_size=131072 -Dsamjdk.compression_level=1 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx128m -jar /path/picard.jar SamToFastq … java -Dsamjdk.buffer_size=131072 -Dsamjdk.use_async_io=true -Dsamjdk.compression_level=1 -XX:+UseStringCache -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx5000m -jar /path/picard.jar MergeBamAlignment …

I don’t know so much about these parameters,

-Xmx5000m, -Dsamjdk.buffer_size, -XX:GCTimeLimit, -XX:GCHeapFreeLimit, -Dsamjdk.use_async_io=true, -XX:+UseStringCache -

i try to learn about them from internet but not clear about how to set these parameters for large data sets. I know this is not GATK related question, but kindly could you please explain a bit in detail how to use them with one-line comprehensive definition, as you guys routinely use them on large HTS data sets at Broad institute.

Thank you very much!

From Geraldine_VdAuwera on 2016-07-28

Hi @MUHAMMADSOHAILRAZA,

That question is outside the scope of support we provide. We are here to help people understand how GATK tools work and how to apply them correctly to answer research questions. We are not here to act as general scientific computing consultants.

Please understand that we have a lot of people asking for our help, so we need to prioritize our efforts. You have been asking an unusually large number of questions, so now it is time for you to figure things out for yourself. For the coming weeks, I expect you to refrain from posting questions unless it is something directly related to GATK tools, you have really tried to find the answer in the documentation, and it is not something you can find out by doing some test runs.

From MUHAMMADSOHAILRAZA on 2016-07-28

@Geraldine_VdAuwera

Hi,

I just asked a question about the line of java jar’s arguments that are suggested above in this documentation page. I know these questions not directly linked with GATK’s tools but relevant with smooth running of Picards tool that is going to integrate with GATK (as far i know) in future.

Secondly, whenevr i post these question related to java parameters on internet, the response always comes in return “it depends the tools that you are using with underlying data set” so actually it was not merely as a general scientific consultation.

as i am new in this so thank you very much for your kind patience and responding towards my all queries even when they are not worth answering might be . I know the GATK team is busy. i never posted a question of which answer i can get anywhere else…

Anyway i will take care next time.

Thanks!

From Kyuwon on 2016-09-02

@shlee

Your method is excellent, but the error message at MergeBamAlignment was shown and have not solved up to now.

I tried to solve this using applying FixMateInformation, SortSam, CleanSam independently and together.

Not all the samples make this error.

I would very much appreciate If you help me to solve the problem.

The error seems about “Requesting earlier reference sequence”

java -Xmx10g -Djava.io.tmpdir=tmp1 -jar picard.jar MerageBamAlignment R=reference.fa UNMAPPED_BAM=sample.ubam ALIGNED_BAM=sample.sam O=sample.bam CREATE_INDEX=true ADD_MATE_CIGAR=true CLIP_ADAPTERS=false CLIP_OVERLAPPING_READS=true INCLUDE_SECONDARY_ALIGNMENTS=true MAX_INSERTIONS_OR_DELETIONS=-1 PRIMARY_ALIGNMENT_STRATEGY=MostDistant ATTRIBUTES_TO_RETAIN=XS VALIDATION_STRINGENCY=SILENT TMP_DIR=tmp2

To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

Exception in thread “main” htsjdk.samtools.SAMException: Requesting earlier reference sequence: 0 < 5 at htsjdk.samtools.reference.ReferenceSequenceFileWalker.get(ReferenceSequenceFileWalker.java:82) at picard.sam.AbstractAlignmentMerger.calculateMdAndNmTags(AbstractAlignmentMerger.java:596) at picard.sam.AbstractAlignmentMerger.clipForOverlappingReads(AbstractAlignmentMerger.java:622) at picard.sam.AbstractAlignmentMerger.transferAlignmentInfoToPairedRead(AbstractAlignmentMerger.java:580) at picard.sam.AbstractAlignmentMerger.mergeAlignment(AbstractAlignmentMerger.java:390) at picard.sam.SamAlignmentMerger.mergeAlignment(SamAlignmentMerger.java:157) at picard.sam.MergeBamAlignment.doWork(MergeBamAlignment.java:266) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

From Sheila on 2016-09-02

@Kyuwon

Hi,

Can you try re-sorting your BAM file? Please make sure you re-generate the index file too.

Thanks,

Sheila

From abhishek_maj08 on 2017-01-17

Hi,

I have a quick question regarding this tutorial. I apologize in advance if it is a trivial/obvious question. In section 3C, you mention that the inputs to MergeBamAlignment, uBAM and alignedBAM should be sorted. The alignedBAM is sorted when it is created, so I don’t need to worry about that. But the uBAM, do I have to sort it separately before-hand. I mean, I want to use the MarkIlluminaAdapters. So do I have to sort my original uBAM file before feeding it as an input to MarkIlluminaAdapters. Or does MarkIlluminaAdapters in addition to adding tags, sorts the output as well. The reason I ask this question is, I plan to use the piped command directly, with the output of MarkIlluminaAdapters as the input. So do I need to worry about sorting the original uBAM before the pipeline starts?

Thanks

From abhishek_maj08 on 2017-01-18

@slee

Sorry for re-posting this. I have a quick question regarding this tutorial. I apologize in advance if it is a trivial/obvious question. In section 3C, you mention that the inputs to MergeBamAlignment, uBAM and alignedBAM should be sorted. The alignedBAM is sorted when it is created, so I don’t need to worry about that. But the uBAM, do I have to sort it separately before-hand. I mean, I want to use the MarkIlluminaAdapters. So do I have to sort my original uBAM file before feeding it as an input to MarkIlluminaAdapters. Or does MarkIlluminaAdapters in addition to adding tags, sorts the output as well. The reason I ask this question is, I plan to use the piped command directly, with the output of MarkIlluminaAdapters as the input. So do I need to worry about sorting the original uBAM before the pipeline starts?

Thanks

From slee on 2017-01-18

Hi abhishek_maj08, I think you meant to tag shlee!

From shlee on 2017-01-19

Thanks for that @slee.

@abhishek_maj08, let me answer your questions as best I can.

In section 3C, the sorting required is query-group-sorted, which can be a slightly different order from query-name-sorted, but can be used interchangeably for most contexts. This is the order that paired-end (PE) reads are in when they are interleaved in section 1. This is also the ordering that MarkIlluminaAdapters expects and maintains in its output. BWA outputs query-group-sorted reads. For PE data, you want to be certain you provide a query-name/group-sorted file to BWA so as not to bias calculations of mean insert size that can arise from providing files containing vestiges of previous alignments, i.e. sort orders based on previous alignment coordinates.

So through all the steps of the workflow, you do not perform any active sorting. Again, no need to worry about sorting.

From shlee on 2017-01-19

Updates and additions to the tutorial

See this github doc I wrote that uses a newer release of Picard and this Picard codebase issue for additional descriptions of MergeBamAlignment features, including a new tag, the FT tag, that the tool adds to unmapped contaminant reads when UNMAP_CONTAMINANT_READS is set to true.
You can now also download the tutorial example data via FTP from ftp://gsapubftp-anonymous@ftp.broadinstitute.org/tutorials/datasets/.
Newer versions of MergeBamAlignment going forward will now copy over AH tags from @SQ header lines if present, and allows the user to specify which header to use in the merged output--that from the uBAM or that from the aligned BAM. See Tutorial#8017 and this Picard codebase issue for a description of the AH tag.

From abhishek_maj08 on 2017-01-26

@shlee

Thank you for your response. I should have mentioned that I start off directly with uBAMs. So basically from Step 2. That is why I did sort them before using them as input to MarkIlluminaAdapters. Is that okay?

From shlee on 2017-01-26

@abhishek_maj08,

Did you queryname-sort? Because that is the sort you should be sure to do if you are sorting at this stage from a sort order that used to be coordinate. Besides the reasons I mention above, this also ensures the pairs are interleaved, i.e. next to each other.

From abhishek_maj08 on 2017-01-26

@shlee Yes, I sorted by queryname.

From elcinchu27 on 2017-02-21

I have written this pipeline in firecloud but I have a problem with the fastq files, because the first fastq has reads of 76 bases while the second one has 39 bases. Could you explain what is the problem with the outputs?

001ND_1.fastq:

HWUSI-EAS1691_0002:2:9:9999:9592#0/1 GTGAGAGAAGTCTATGAAAGGGCCATTGCCAATGTCCCACCCATTCAGGAGAAGAGGCACTGGAAGCGCTACATTT + CCC-CCDBDDFDEFFFEFDFFFFFCEE?EEFFFBEFFFFFECCDBEBE?BBFEEF?DF=DBBBBBDDEA

?A@@=B

001ND_2.fastq:

HWUSI-EAS1691_0002:2:9:9999:9592#0/2 TAGGTGAGCTGTGAATACAGAAGACAGGAAACCATTATC + DDDD

DBDDDFDFEFEEFEDEE?DDFFA=DFFFDFEEDE

Thank you for your help,

BamToFastq_Firecloud.txt

From shlee on 2017-02-21

Hi @elcinchu27,

When you say problem, what do you mean, e.g. do you get an error message or are the lengths of reads that you input different from the resulting reads that you are showing?

From elcinchu27 on 2017-02-21

The problem is that the lengths of reads that I input differe from the resulting reads that I have in the fastq. All the reads should be of 76 bases, but in the 001ND_2.fastq they only have 39.

From shlee on 2017-02-21

Can you confirm in the aligned BAM that truncated reads are not truncated. E.g. can you post here the result of grepping out `@HWUSI-EAS1691_0002:2:9:9999:9592` from the original aligned BAM?

Otherwise, examining your commands I don’t see anything that should truncate reads. Your SamToFastq clipping action is to replace the base qualities to 2. Are you using the [latest Picard release](https://github.com/broadinstitute/picard/releases)? Looks like it’s now at v2.9.0.

From elcinchu27 on 2017-02-21

You are right, the original bam is truncated too and I do not know why. It is my first experience with this pipeline and I thought that there was something wrong with my script. I have never seen a truncated read in a bam before so sorry for that.

Regarding the Picard version, I was using the 2.5 version because of the docker file that I have, but I could change it or the last version without any problem. Do I need to introduce important variations in the Picard 2.9 commands?

Another issue that I would like you to ask is about the “VALIDATION_STRINGENCY=LENIENT” option. What I want with this script is the realignment of all the reads in my bam file, because I need to use a different BWA algorithm. Am I losing the not mapped reads with a MQ different of 0?

Thank you so much. I really appreciate your help,

From shlee on 2017-02-21

If someone handed you a script and told you to use a particular Docker image, then the versions of the tool and the tool options that the script invoke likely go hand-in-hand. That being said, Picard tool options change infrequently. Updates typically provide you with more options and also fix bugs. You can read the Picard release notes [here](https://github.com/broadinstitute/picard/releases). If you update, you should test to make sure your commands still work as expected and the resulting data is as expected.

The `VALIDATION_STRINGENCY` parameter is a [standard option](https://broadinstitute.github.io/picard/command-line-overview.html#StandardOptions). What happens when you don’t set this to LENIENT? The results should tell you something about your data.

To revert align reads for realignment, see [Tutorial#6484](http://gatkforums.broadinstitute.org/discussion/6484/#latest#top).

From elcinchu27 on 2017-02-21

When I use the LENIENT value I do not have any problem but when I use the STRICT default value in VALIDATION_STRINGENCY, I have this error:

[Mon Feb 13 22:28:58 UTC 2017] picard.sam.AddOrReplaceReadGroups done. Elapsed time: 36.17 minutes.

Runtime.totalMemory()=28442624

To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

Exception in thread “main” htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 100838398, Read name HWUSI-EAS1691_0002:1:16:4292:16941#0, MAPQ should be 0 for unmapped read. at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:441) at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:665) at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:650) at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:620) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:545) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:519) at picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:140) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

When I check if those specific reads are lost or not in the final FASTQ file, they are in the output file with the LENIENT option, so I supposed that there is no problem with them.

From Geraldine_VdAuwera on 2017-02-22

If you’re going to be remapping them anyway I wouldn’t worry about that, lenient processing should be fine.

By the way, we deposited some WDLs for reverting BAM files in the WDL repository (look in the scripts/broad_dsde_workflows directory) which may come in handy for your project, unless you’re already all set. We’re also going to make clonable workspaces in FireCloud preloaded with related methods in the coming months, but I assume that will be too late for you.

From agr_lab_nbrc on 2017-02-24

Picked up JAVAOPTIONS: -Djava.io.tmpdir=/galaxy-repl/main/jobdir/015/129/15129530/galaxytmp -Xmx15872m -Xms256m [Fri Feb 24 10:12:32 CST 2017] net.sf.picard.sam.CreateSequenceDictionary REFERENCE=/galaxy-repl/main/jobdir/015/129/15129530/galaxytmp/tmp-gatk-aGAWJD/gatkinput.fasta OUTPUT=/galaxy-repl/main/jobdir/015/129/15129530/galaxytmp/tmp-gatk-aGAWJD/dict7762336907034131389.tmp TRUNCATENAMESATWHITESPACE=true NUMSEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATIONSTRINGENCY=STRICT COMPRESSIONLEVEL=5 MAXRECORDSINRAM=500000 CREATEINDEX=false CREATEMD5FILE=false [Fri Feb 24 10:12:59 CST 2017] Executing as g2main@roundup61.tacc.utexas.edu on Linux 2.6.32-504.8.1.el6.x8664 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_40-b43; Picard version: 1.58(1057) [Fri Feb 24 10:13:03 CST 2017] net.sf.picard.sam.CreateSequenceDictionary done. Elapsed time: 0.53 minutes. Runtime.totalMemory()=754450432

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(Unknown Source) at java.util.ArrayList.get(Unknown Source) at net.sf.samtools.Cigar.getCigarElement(Cigar.java:54) at org.broadinstitute.sting.gatk.walkers.indels.PairHMMIndelErrorModel.computeReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:441) at org.broadinstitute.sting.gatk.walkers.genotyper.IndelGenotypeLikelihoodsCalculationModel.getLikelihoods(IndelGenotypeLikelihoodsCalculationModel.java:392) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoods(UnifiedGenotyperEngine.java:246) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:184) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:276) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:112) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:78) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:18) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:63) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:233) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:122) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:90)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version exported):

ERROR

ERROR Please visit the wiki to see if this is a known problem

ERROR If not, please post the error, with stack trace, to the GATK forum

ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki

ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa

ERROR

ERROR MESSAGE: Index: 0, Size: 0

ERROR ------------------------------------------------------------------------------------------

the above error i face when i run the workflow on galaxy where forward fastq and reverse fastq after paired end sequencing is the input files for bwa and then bam file is generated - this error occurs in Unified Genotyper for variant calling

Please Help!

From shlee on 2017-02-24

Hi @agr_lab_nbrc,

It looks like you’re running a Galaxy pipeline with multiple tools chained together. To figure out what is causing the error, I recommend you run your commands one at a time using the latest commandline version of the tools. Installation instructions for Picard and GATK tools are at . Also, we recommend you use HaplotypeCaller for variant calling, not UnifiedGenotyper. Take a look at our Best Practice recommendations at .

From alphahmed on 2017-05-15

Thank you for the clear and thorough explanation of the tool.

I’ve been trying to use the suggested java parameters to work on my large sam files. However, each time trying to run the script, I get an error related to memory “requested array size exceeds VM limit.” I believe it is unavoidable due to the index string management design of java.

I do understand that your team’s valued time is for focusing on the GATK tools rather than computer-science questions, however, I find this step very important and it seems that the “-XX:+UseStringCache” option has been removed in 1.8 java, which is required for the latest Picard tool.

Any suggestions for seeking an alternative way to clean up a large sam file?

I’d appreciate any kind reply!

Ahmed

From shlee on 2017-05-15

Hi @alphahmed,

The “requested array size exceeds VM limit” error sounds like a Java error. Try increasing your Java memory with the `-Xmx` parameter. To find your system’s available Java memory:

> To find a system’s default maximum heap size, type `java -XX:+PrintFlagsFinal -version`, and look for `MaxHeapSize`. Note that any setting beyond available memory spills to storage and slows a system down. If multithreading, increase memory proportionately to the number of threads. e.g. if 1G is the minimum required for one thread, then use 2G for two threads.

You should be able to omit the `-XX:+UseStringCache` option from your command and have the tool run. I’m not familiar with all the parameters that we use in pipelining that I include at the very end of the tutorial. I do not use these parameters myself.

Alternatively, to avoid having to plumb all these options, check out our cloud pipeline, which aligns to GRCh38 and preprocesses alignments in Google Cloud. You can find the public version of the WDL script that details the different tool options at . The Docker image containing all the tools that the script uses is publicly available for anyone to use. I explain an older version of the WDL script in [Article#7899](http://gatkforums.broadinstitute.org/gatk/discussion/7899). You can see that MergeBamAlignment is a part of the workflow and you could use the given parameters as a starting point to process your large BAMs in the cloud.

Thanks for your compliment! I hope this is helpful.

From jfiksel on 2017-06-27

Thanks for this great article! I was just wondering in 3C and 3D you start using

`R=/path/Homo_sapiens_assembly19.fasta`

rather than

`R=/path/human_g1k_v37_decoy.fasta`

Will these give the same output? Or is `Homo_sapiens_assembly19.fasta` a different download?

From shlee on 2017-06-28

Hi @jfiksel,

That’s a great catch. Yes, they are the same reference when concerning the primary assembly. I’ve come to learn more about reference genomes since writing this tutorial as a newb in the field. The former, `Homo_sapiens_assembly19.fasta` is named unconventionally in that both references use the GRCh37 naming convention that omits the `chr` from contig names. For the work in this tutorial, I was switching between the reference available to me on the server and a reference downloaded from the GATK bundle. There is one difference—the former lacks the `hs37d5` decoy contig. The tutorial aligns to the reference with the decoy and then uses the reference without the decoy at the MergeBamAlignment step. I think a potential consequence of this (unchecked hypothesis) is that the decoy contig alignments revert to unmapped reads.

From freuv on 2017-09-19

@shlee Thank you for this tutorial — it is extremely thorough!

From egarmo on 2017-09-28

Hi, your getconf_NPROCESSORS_ONLN appears to be intended within the reach of Broad servers, or specific utility for a Mac Laptop command line. In order to make an educated guess for the setting of bwa mem -t I fell back on Linux $lscpu and interpreted output according to … https://askubuntu.com/questions/668538/cores-vs-threads-how-many-threads-should-i-run-on-this-machine

From shlee on 2017-10-03

Yes @egarmo, those commands are given as examples for further considerations for those running pipelines. Thanks for the additional information and link in setting threads.

From abefola on 2017-10-04

Hi Shlee,

Thank you for providing a nice and clear protocol.

I am doing parallel amplicon sequence of short reads (150-300bp) using Miseq Illumina platform. I got fastq paird end data for all my samples. So can I use this guideline to map my short amplicons with my reference ?

Best Regards,

Abe

From Sheila on 2017-10-05

@abefola

Hi Abe,

Yes, you can use this for amplicon data :smile:

-Sheila

From pschnepp on 2017-10-09

Hi, I’ve been having some trouble with the MergeBamAlignment step. I’ve got both the unmapped and aligned bam to be queryname sorted. But when I go to merge them, I get the following error:

Exception in thread “main” net.sf.picard.PicardException: Unequal number of first and second of pair hits for read: ERR1211176.10000190 at net.sf.picard.sam.MultiHitAlignedReadIterator.next(MultiHitAlignedReadIterator.java:128) at net.sf.picard.sam.AbstractAlignmentMerger.nextAligned(AbstractAlignmentMerger.java:348) at net.sf.picard.sam.AbstractAlignmentMerger.mergeAlignment(AbstractAlignmentMerger.java:276) at net.sf.picard.sam.SamAlignmentMerger.mergeAlignment(SamAlignmentMerger.java:154) at net.sf.picard.sam.MergeBamAlignment.doWork(MergeBamAlignment.java:172) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.sam.MergeBamAlignment.main(MergeBamAlignment.java:142)

I’ve run ValidateSamFile and fixed most of the errors there but still get ERROR:MATES_ARE_SAME_END 40347

ERROR:MATE_NOT_FOUND 70809

for the mapped alignment (fixed_name_sorted.bam). Any help in solving this would be great.

From shlee on 2017-10-09

Hi @pschnepp,

The error:

> Exception in thread “main” net.sf.picard.PicardException: Unequal number of first and second of pair hits for read: ERR1211176.10000190

suggests something unexpected is going on that the SAM flag values can tell us about. Can you `samtools view xyz.bam | grep ‘ERR1211176.10000190’` and post here in this same thread all of the records that the grep returns? Also, can you post the MergeBamAlignment command and tell us briefly about how you processed your alignments up to the MergeBamAlignment step?

Run the 2nd column SAM flag values through and see if all of the flag values make sense for the read name set.

From garrulus_glandarius on 2017-10-18

I’m trying to implement the offered way of processing files into my workflow but I don’t fully understand the second step that adds XT tags. If I understand correctly from what

shlee said: > Yes the workflow in the tutorial uses the BAM from the RevertSam step as the unmapped BAM and so we do not expect to carry-over the XT tag. However, as I replied to

fabiodp, you can also use the BAM from the MarkIlluminaAdapters step as the unmapped BAM so as to carry-over the XT tag.

uBAM and uBAM with XT tags could be used interchangeably? Could you please have a look at my probable workflow of obtaining clean BAM and tell me if I understood the recommendations from this tutorial correctly or not? Of note, my data is obtained through amplicon sequencing. ![](https://us.v-cdn.net/5019796/uploads/editor/3x/pn247f6llvwy.png “”)

From garrulus_glandarius on 2017-10-18

Now I see another option: could I use RevertSam on primer-clipped BAM and then MergeBamAlignment using the primer-clipped BAM and uBAM originated from it as input?

From Sheila on 2017-10-23

@garrulus_glandarius

Hi,

Let me confirm my answer with the team and get back to you.

-Sheila

From SkyWarrior on 2017-10-25

Could primer clipping be done after merge bam file step ?

That way your bam file may have lesser issues downwards.

From garrulus_glandarius on 2017-10-26

@SkyWarrior

Well, I guess, it wouldn’t help.

MergeBamAlignment is intended to clean the data and fix issues, so it’s supposed to be done after the BAM processing steps, I guess. I’ve [tried ](https://www.biostars.org/p/279503/ “tried “)soft-clipping with Katana and BamClipper, and each of them introduced errors. The question is how to fix them…

From Sheila on 2017-10-31

@garrulus_glandarius

Hi,

Here are some tips that should help you with all of your other issues (we hope!).

1) All primer clipping should be done before alignment otherwise you will get errors. It seems like you are doing some clipping after alignment, which is causing the issue. Can you try clipping everything before alignment?

2) Are you doing 5’ or 3’ end clipping? The XT tag is used for 3’ end clipping. I think that can be done after alignment, but the 5’ end clipping needs to be done before alignment. Our tools cannot handle clipping of PCR primers after alignment. Also, if you are clipping at the 5’ end, you need to hard clip those.

-Sheila

From garrulus_glandarius on 2017-11-01

@Sheila

Hi,

Thanks for the clarification.

1) So there are 2 options: to soft-clip FASTQ, which is not advisable, or to make an uBAM out of FASTQ, soft-clip it and run alignment? Whould it be sensible?

2) I guess, Katana and BamClipper clip reads at the 5’-end. Why do we need hard-clipping at the 5’-end?

From SkyWarrior on 2017-11-01

You cannot softclip uBAM. There is no alignment info in there.

Most sensible option based on the article you showed us

1- Do all alignment and refinement steps and obtain your clean bam.

2a – Do the final clipping on the clean bam that you have and continue

2b – exclude primer sites from variant calling and get your variants free from Primer artifacts.

EDIT: Exactly my thoughts from BAMClipper samples provided by the authors

![](https://us.v-cdn.net/5019796/uploads/editor/mz/argt12gqcqrz.png “”)

From shlee on 2017-11-03

Hi @garrulus_glandarius,

Sheila asked that I followup on the answer I gave before as it seems it was unclear to you. Apologies. This is a subject matter I haven’t thought on in a while, so it’s possible some of my enumerations are incorrect.

To be clear, we are talking about 5’ clipping here.

When clipping unaligned reads, there can only be hard-clipping. I am not aware of GATK or Picard tools that do hard-clipping except for [BamToBfq](http://broadinstitute.github.io/picard/command-line-overview.html#BamToBfq), which does not appear to apply to your case. I’m not familiar with any of the trimming tools you mentioned. However, I am aware of an outside tool called [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) that trims FASTQ sequences. Perhaps it would interest you. Alternatively, perhaps your sequencing center’s software that converts sequencing imagery to FASTQ/BAM has an option for trimming a certain number of bases from the start of reads.

Let’s consider the impact of hard or soft clipping aligned reads.

First some background. Our workflows assume all adapter sequences have been trimmed (hard-clipped) before pre-processing. Alignment by BWA MEM introduces both soft and hard clips that MergeBamAlignment then amends by switching hard to soft clips and reinstating hard-clipped bases. HaplotypeCaller, when performing graph-assembly on a locus, throws out the aligner-issued CIGAR (containing the soft-clip information) and reissues its own CIGAR based on reassembly. Remember that HaplotypeCaller uses not only alignment information but base quality scores in calling variants. I will come back to this.

As far as I’m aware, our pre-processing workflows are incompatible with hard-clipping post-alignment as this produces invalid records according to ValidateSamFile. We do not have tools towards fixing the types of inconsistencies that arise from post-alignment hard-clipping. However, if you could fix all the inconsistent information, e.g. inconsistent read length and CIGAR string and mate cigar (MC) info, and are comfortable with the now potentially inflated or deflated aligner-assigned scores (MAPQ, AS) etc, then by all means hard-clip away.

As for soft-clipping post-alignment, because HaplotypeCaller (I assume you are calling with HaplotypeCaller as we’ve deprecated UnifiedGenotyper) throws away CIGAR strings, any soft-clipped information becomes fair-game in HaplotypeCaller reassembly. These sequences you wish to discount from calling are undesirably back in the game.

One thought I have is that you could assign a special and low base quality score (of 2 or `#`) to the bases corresponding to the 5’ end sequences you wish to discount. This can be done to the FASTQ or unaligned BAM. We have one tool that does this for 3’ ends of reads. Specifically, [SamToFastq](http://broadinstitute.github.io/picard/command-line-overview.html#SamToFastq) will trim based on a given tag, e.g.`CLIPPING_ATTRIBUTE`=XT and perform a variety of clipping actions including changing base quality, e.g. `CLIPPING_ACTION`=2. The impact of this is that BQSR will ignore such bases (see question by jhess and the subsequent answer in the comments section of ). Also, these bases will effectively not factor towards HaplotypeCaller variant calls.

If this last solution is of interest to you, either you could find or write your own tool to add a special tag (see [SAM specifications](https://samtools.github.io/hts-specs/) for which letter tags are available to you) and to BQ2-clip at the 5’ end, or you (or I if you ask of me) could put in a feature request to the [Picard github repository](https://github.com/broadinstitute/picard) to incorporate such an option into SamToFastq.

Well, this has gotten rather long. I hope I’ve been helpful.

P.S. HaplotypeCaller is actually a tool that will process certain types of invalid BAMs. I don’t know all the specifics but you could certainly try out your clipped reads to see if the tool runs on them.

From mattwell on 2018-04-19

Hi,

Is there an equivalent tutorial for the GATK 4 user guide? Or can I use the pre-processing protocols listed here as a template to generate analysis ready BAM for GATK 4 haplotypeCaller?

Thanks

From SkyWarrior on 2018-04-20

This tutorial is about the common steps before GATK action. You can use them with no issues.

From FPBarthel on 2018-06-01

Hi @shlee, thank you for this “beast of a tutorial”! Very helpful for designing my workflow. Since I am running into some issues with a small (but plural) number of samples (briefly described [here](https://gatkforums.broadinstitute.org/gatk/discussion/comment/49233/#Comment_49233 “here”)) I am trying to understand if there is anything going wrong here. Since my tests are taking a while to run I am going over this tutorial again and I noticed you posted the following comment.

> @shlee said:

> 1. See [this github doc I wrote](https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_160927.md#MergeBamAlignment) that uses a newer release of Picard and [this Picard codebase issue](https://github.com/broadinstitute/picard/commit/a325e3356483f02dc066db713c6f0b098f6a3b4b) for additional descriptions of MergeBamAlignment features…

Unfortunately, the link to the updated tutorial and codebase above are broken. Can you provide new links?

EDIT: I noticed that the pre-processing WDL [here](https://github.com/gatk-workflows/gatk4-data-processing/blob/master/processing-for-variant-discovery-gatk4.wdl “here”) no-longer has the three-step pipe `samtofastq | bwa mem | mergebamalignment` but splits this into two steps 1) `samtofastq | bwa mem` followed by `mergebamalignment`. What was the reason for taking these steps apart?

From Sheila on 2018-06-11

@FPBarthel

Hi,

I will have Soo Hee get back to you.

-Sheila

From shlee on 2018-06-13

Hi @FPBarthel,

Thanks for pointing out the broken link. Looks like some folks reorganized that section of the repo and the linked document was deleted in the process.

The broken link is `https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_160927.md#MergeBamAlignment`. It looks like a colleague of mine has instead provided an updated document in its stead at `https://github.com/broadinstitute/wdl/blob/master/scripts/broad_pipelines/germline-short-variant-discovery/gvcf-generation-per-sample/0.2.0/PublicPairedSingleSampleWf_170412.md`.

You mention that this workflow pipes three processes:

> > `samtofastq | bwa mem | mergebamalignment` but splits this into two steps 1) `samtofastq | bwa mem` followed by `mergebamalignment`

It does not. And skimming the `PublicPairedSingleSampleWf_170412.md` document I see what might be tripping you up:

—-

![](https://us.v-cdn.net/5019796/uploads/editor/b6/rhortk9dggzd.png “”)

—-

Looks like the updated GRCh38 article is erroneously pointing to the GRCh37 workflow ([Article#6483](https://gatkforums.broadinstitute.org/gatk/discussion/6483/reference-implementation-pairedendsinglesamplewf-pipeline)) where the three processes you mention were indeed piped. I will let my colleague know that this should be updated to point instead to the GRCh38 workflow.

The original GRCh38 document I researched and wrote back in May/June of 2016 is recapitulated in its entirety in Article#7899 at https://gatkforums.broadinstitute.org/gatk/discussion/7899/reference-implementation-pairedendsinglesamplewf-pipeline. However, as noted in other forum posts, some minor elements of this workflow became outdated, e.g. replacement of `SetNmAndUqTags` with `SetNmMdAndUqTags` and use of MarkDuplicates on queryname sorted reads instead of coordinate-sorted reads, soon after it was written. Unfortunately, this particular document does not allow for a comments section (this is intentional) for researchers like yourself to track updates. So for updates, you can see at top I point researchers to the wdl GitHub repo. If you think it would be helpful for you to have an updated Article#7899, then let me know. I tend to get focused on the writing at hand and just assumed folks would find the WDL repo scripts the most useful.

Again, if you peruse [Article#7899](https://gatkforums.broadinstitute.org/gatk/discussion/7899/reference-implementation-pairedendsinglesamplewf-pipeline#SortAndFixTags), you will see that neither the workflow in question nor the gatk4-workflows workflow pipes MergeBamAlignment. Only SamToFastq to BWA is piped in both. In particular, see the [SamToFastqAndBwaMem task](https://gatkforums.broadinstitute.org/gatk/discussion/7899/reference-implementation-pairedendsinglesamplewf-pipeline#SamToFastqAndBwaMem) and the [MergeBamAlignment task](https://gatkforums.broadinstitute.org/gatk/discussion/7899/reference-implementation-pairedendsinglesamplewf-pipeline#MergeBamAlignment). They are clearly separate for the GRCh38 preprocessing workflow.

Thank you for the compliment. I am learning much by delving into the topics I write about.

From Weichi on 2018-06-21

It’s my first time using MergeBamAlignment to generate BAM file . I don’t know why it looks strange in IGV.

![](https://us.v-cdn.net/5019796/uploads/editor/9c/kywa701ndrob.jpg “”)

And I use ‘cat’ and ‘grep’ to see what happened inside the file, I found that the sequence in Merged sam file is different with aligned sam file.

uBAM:

K00236:115:HGLKJBBXX:3:1206:11434:31470 4 * 0 0 * * 0 0 CATGTGTGCAATTAGCGTGATGAGCTCTGACATGGCCTTGCATGGACGGATTGGGCAGGACACCCCAGCTGAGGAGGATGGCAGGAGTGATGGCACAGGGGAAAGGGTGGCATACCCAGGTGACAGCTCCCCACTACCTCCACTCTGTGCTAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ RG:Z:A

K00236:115:HGLKJBBXX:3:1206:11434:31470 4 * 0 0 * * 0 0 GTACAGGCAAGTCCCTTTTGTCAGCTGGGTGTCCTGGAACCAAGGGAATGAGGCCAGGGCTGGTACAGAGGGGCTGAGTTGCAGCAGAAGACCCAGCCCCTGAGCTGCAGCACAGAGTGGAGGTAGTGGGGAGCTGTCACCTGGGTATGCC AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJFJJJJJFJJJJJJJJJJ7JJJJJJJJJJJJFJJJJAJJJJJJJJJJFJJJJJJJJJJJJJJJJFJJJJJJJJJJJJAJFFJJJJJJFJJJAFJJJJJJJJ

BAM

K00236:115:HGLKJBBXX:3:1206:11434:31470 0 chr22 16053092 8 151M * 0 0 CATGTGTGCAATTAGCGTGATGAGCTCTGACATGGCCTTGCATGGACGGATTGGGCAGGACACCCCAGCTGAGGAGGATGGCAGGAGTGATGGCACAGGGGAAAGGGTGGCATACCCAGGTGACAGCTCCCCACTACCTCCACTCTGTGCTAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:151 AS:i:151 XS:i:146 XA:Z:chr14,-19789767,151M,1;

K00236:115:HGLKJBBXX:3:1206:11434:31470 16 chr22 16053200 8 151M * 0 0 GGCATACCCAGGTGACAGCTCCCCACTACCTCCACTCTGTGCTGCAGCTCAGGGGCTGGGTCTTCTGCTGCAACTCAGCCCCTCTGTACCAGCCCTGGCCTCATTCCCTTGGTTCCAGGACACCCAGCTGACAAAAGGGACTTGCCTGTACFFAJFFJJF

MergedBAM

K00236:115:HGLKJBBXX:3:1206:11434:31470 272 chr22 16053200 8 151M * 0 0 AGCACAGAGTGGAGGTAGTGGGGAGCTGTCACCTGGGTATGCCACCCTTTCCCCTGTGCCATCACTCCTGCCATCCTCCTCAGCTGGGGTGTCCTGCCCAATCCGTCCATGCAAGGCCATGTCAGAGCTCATCACGCTAATTGCACACATGJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA MD:Z:0G3T1C0C0C0A2T1A0C2C0T0C0C0C0C0A2A0C1T2A0C0T0C1G3T0G1A0G0C1C0A0G0G0G0G0C2G0G0T0C0T0T2G4A1C0T1A0G1C1C0T3T0A0C0C0A0G2C0T0G0G1C0T0C0A0T0T0C2T2G0T0T0C0C0A0G0G0A0C0A1C0C4G1C0A1A0A0G0G0G1C1T0G0C1T0G0T0A0C0 PG:Z:bwa RG:Z:A NM:i:99 UQ:i:4029 AS:i:151

I have no idea what happened.

Could anyone help me to fix the problem?

Thanks.

From shlee on 2018-06-22

Hi @Weichi,

Please make sure you’ve loaded the correct reference for your data. IGV allows for loading data regardless of matching reference. As for your question about read metadata changing post-MergeBamAlignment, this is to be expected.

From Weichi on 2018-06-25

Hi @shlee ,

Thanks for your reply.

I fixed the problem by re-generated the fastq file.

I made a small fastq file from bam file to test the pipeline.And I found that this small fastq contain R1 and R2 (in one file). I thought it might be the source of error.

So I re-generated the fastq file by random sampling reads from original fastq file then made two small fastq file (fastq_R1 and fastq_R2).

I used this two fastq to run the pipeline (Fastq->uBAM->…->Merged BAM) successfully.

And this Merged BAM looks correct in IGV!

Thanks for your help !

Weichi

From shlee on 2018-06-25

Glad everything is working for you @Weichi. Thanks for sharing your solution.

From pschnepp on 2018-07-06

Hi,

Has anyone see this error when running MergeBamAlignment

```

Exception merging bam alignment – attempting to sort aligned reads and try again: Inappropriate call if not paired read

```

I’m using pair end data from the Drop-seq pipeline. I’ve sorted my aligned and unmapped bam files using Picard as suggested but keep getting this error. Any help would be great.

pschnepp

From shlee on 2018-07-06

@pschnepp, please do not post duplicate questions. I have followed up on your question in your original post at https://gatkforums.broadinstitute.org/gatk/discussion/12259/mergebamalignment-problem#latest.

From macko.tipor on 2018-08-07

Hi,

I use bamclipper to softclip primer sequences post-alignment before Mutect2 variant calling. You mentioned in a previous post in this thread that HaplotypeCaller throws away CIGAR strings, so all soft-clipping information is lost before variant calling. What about Mutect2 in this regard? Does it behave in the same way as HaplotypeCaller or exclude soft-clipped sequences from variant calling?

So does it make any sense to do this bamclipper step?

Thanks in advance!

From Sheila on 2018-08-12

@“macko.tipor”

Hi,

Mutect2 does the same reassembly step as HaplotypeCaller. All CIGAR strings are thrown away before the reassembly and it does not exclude soft clips in variant calling unless you request the tool to.

-Sheila

From macko.tipor on 2018-08-13

Hi Sheila,

Thanks for the info, it is clear now.

I evaluate an amplicon sequencing run. However, it is a bit unusual design since each amplicon has only one target-specific primer. The other primer binds to an adapter sequence. I have almost 1 billion reads and more than 13,000 primers. It seems unrealistic to try to trim primer sequences in fastq files. It would take a very long time given our 128 GB RAM server. In addition, I prefer to avoid generating fastqs. Therefore I would like to remove primer-binding sites by soft-clipping post-alignment. Bamclipper, it seems to me, can be used for it, and all the errors and warnings generated during such a step can be eliminated by Picard. However, since bamclipper expects two target-specific primers for each amplicon, I succeeded in soft-clipping primer sequences only at the 5' end of the reads by its usage. The read-through to primers at the 3' end of the mates can't be marked in this way. Since all 5' primer sequences are in read1 and all 3' primer readthroughs are in read2, a tool that always keeps read1 information and softclips read2 in overlapping regions seems to be suitable to tackle this issue. To clarify this point, I would like to test the effect of Picard's MergeBamAlignment CLIPOVERLAPPINGREADS option to soft-clip 3' primer readthroughs before the bamclipper step.

I use Picard version 2.18.11. In my workflow, MergeBamAlignment is the last part of a pipe as described in your Best Practices Pipeline. I applied CLIPOVERLAPPINGREADS=true and SORT_ORDER=queryname settings for the test run. However, when I examined the output files, I could see two problems.

1/ The output BAM files were sorted by coordinate. I checked the sort order in the header, too. 2/ Overlapping reads remained unclipped.

A mate pair from the output BAM is below. They are obviously overlapping but neither one is clipped in that region.

AH3LCYBGX3:1:12212:8851:10865 163 1 1857237 60 128M = 1857280 190 CACAGACTCTCCGGCAATCACAAAGCCCATGTTGAGGACGCTGCCTTCAATGGAGCACGTGATCATGGACGCCACGCCAGTGCCCATGAGGGTGAGGGTGAGCGTGCCTCTCTTGGAGATGATGTCCA 6/A6A/EEEEEE/AE/EEA/AE/EE<A/E/<EEAEE//6E/E//EE6E/</</E<EE///</EAAE//E//EAAA//<AEEEEEEEA/AE/6EE<<A6A<EE/A////A/A<EEEE/AE######### MC:Z:147M4S MD:Z:22T93T11 PG:Z:bwa RG:Z:AH3LC.1 NM:i:2 MQ:i:60 UQ:i:28 AS:i:118 QX:Z:/AAA/AE/EEAE RX:Z:GAGAGAGGGAGC AH3LCYBGX3:1:12212:8851:10865 83 1 1857280 60 147M4S = 1857237 -190 CCTTCATTGGAGCACGTGATCATGGACGCCACGCCAGTGCCCATGAGGGTGAGGGTGAGCGTGCCTCTCTTGGTGATGATGTCCAGTGTTTCTTGAGCCTAAAGAGACCACATCGCTCAGCCAGGAACCGATGTCTGCAGGAGCCACAGGT A/EAAE/AA//<AA66//EEEA<6/EE</AEAEA/6//A/6EEE<AAEA/AE<<E/EEAEE/6E//AAEEEAEAE/EE/EEE/E//EE6EAEEE6EEE/EEEA/EEEEEEEEEEAEEEEEEEEAE/E/<EAEEEEEEEEEAEEAA6AAAAAMC:Z:128M MD:Z:6A131T8 PG:Z:bwa RG:Z:AH3LC.1 NM:i:2 MQ:i:60 UQ:i:50 AS:i:137 QX:Z:/AAA/AE/EEAE RX:Z:GAGAGAGGGAGC

What can be the reason for these problems? Could you help to fix them? Thank you very much!

From yfarjoun on 2018-08-15

Hi, While you are correct that the reads are overlapping (in that there are regions of the genome that are covered by both reads) the length of the insert is longer than the reads, so there are no overhanging bases (tangling beyond the start of the mate) that would need to be clipped in this case.

For applicalbe reads, I would search for those that have insert length (column 9) of size 150 or less.

I’m confused about your claim that using SORT_ORDER=queryname emitting coordinate-sorted reads. As I see it, if SORT_ORDER isn’t “coordinate” the code write the output in the same order as the input….so that might be the problem….what is your input sorted as?

From macko.tipor on 2018-08-16

Hi,

Thank you very much for your remarks.

I see now that this option soft-clips 3’ overhangs in overlapping mate pairs, not the part of one mate of a pair that overlaps the other to remove duplicate information coming from a fragment as I initially thought.

I examined several read pairs derived from inserts shorter than 150 bp. There are soft-clips in them, but those clips seem to be composed of adapter-clippings and aligner-issued clippings due to alignment failure.

However, I could create a test object to examine the effect of this option. I soft-clipped 5’ primer sequences in reads1, then removed all clippings. The 3’ ends of reads2 reading through to the primer-binding site now overhang the 5’ end of their mate. I reverted the BAM to uBAM, then applied the [SamToFastq] | [BWA-MEM] | [MergeBamAlignment] pipe with settings CLIP_OVERLAPPING_READS=true and SORT_ORDER=queryname. Then I searched for inserts longer then 151 – min(primer length) but shorter than 151 bp. I could see that 3’ read-throughs to primer-binding sites were really soft-clipped.

In the above experiment, the uBAM that was the input of the pipe was queryname-sorted. I could see from the documentation of MergeBamAlignment that the default sort order of the ouput file is ‘coordinate’. Since I wanted to test the effect of queryname sorting on MarkDuplicates, I set SORT_ORDER to queryname. But in spite of the settings, the output merged BAM was coordinate-sorted.

It is not a serious problem, because I could sort the merged BAM in a SortSam step. It’s just strange. But the reason of this behavior may be in my command. The exact command was the following:

find “$DIR_INPUT_BAM” -maxdepth 1 -name *.”$SUFFIX_INPUT_BAM” | “$PARALLEL” —plus ‘\ INPUT_BAM={} SAMPLE_NAME=”$(‘basename’ “$INPUT_BAM” .’$SUFFIX_INPUT_BAM’)“ OUTPUT_BAM=’$DIR_OUTPUT_BAM’/”$SAMPLE_NAME”.’$SUFFIX_OUTPUT_BAM‘ # An interleaved fastq file gets to stdout. ‘java -Xmx4g -jar $PICARD’ SamToFastq \ INPUT=”$INPUT_BAM” \ FASTQ=/dev/stdout \ CLIPPING_ATTRIBUTE=’”$CLIPPING_ATTRIBUTE”’ \ CLIPPING_ACTION=’”$CLIPPING_ACTION”’ \ INTERLEAVE=true \ TMP_DIR=’”$DIR_TMP”’ | \ ‘$BWA’ mem \ -t 8 \ -M \ -p \ -L ‘$CLIPPING_PENALTY’ \ ‘”$Refseq”’ \ /dev/stdin | \ ‘java -Xmx4g -jar $PICARD’ MergeBamAlignment \ UNMAPPED_BAM=”$INPUT_BAM” \ ALIGNED_BAM=/dev/stdin \ OUTPUT=”$OUTPUT_BAM” \ REFERENCE_SEQUENCE=’”$Refseq”’ \ CREATE_INDEX=false \ ADD_MATE_CIGAR=true \ ALIGNED_READS_ONLY=true \ MAX_INSERTIONS_OR_DELETIONS=-1 \ ATTRIBUTES_TO_RETAIN=XS \ PRIMARY_ALIGNMENT_STRATEGY=’”$PRIMARY_ALIGNMENT_STRATEGY”’ \ CLIP_ADAPTERS=false \ CLIP_OVERLAPPING_READS=true \ INCLUDE_SECONDARY_ALIGNMENTS=’”$INCLUDE_SECONDARY_ALIGNMENTS”’ \ SORT_ORDER=’”$SORT_ORDER”’ \ TMP_DIR=’”$DIR_TMP”’‘

I use the pipe inside GNU Parallel, and the above command is part of a shell function whose argument for variable SORT_ORDER was “queryname” in this case.

I checked the sort order of the input and output files of MergeBamAlignment as shown below:

samtools view -H 172N_S7.reverted.bam | grep ‘@HD’ @HD VN:1.5 SO:queryname samtools view -H 172N_S7.reverted.merged.bam | grep ‘@HD’ @HD VN:1.5 SO:coordinate

From yfarjoun on 2018-08-16

Thanks for verifying.

It seems that there is a bug in MBA whereby it doesn’t fix the sort order when the output sort order is different from “coordinate”. I’ll put in an issue to fix this. thanks!

From yfarjoun on 2018-08-16

I’m still confused. can you post the log-file you got for the SORT_ORDER=queryname invocation?

From macko.tipor on 2018-08-16

I didn’t create any log-file intentionally. If it is created automatically, where can it be found?

Anyway, I can invoke the pipe again and save a log-file. Do you mean by it the info shown in the terminal while the command is running? If not, what shall I do to save one?

One piece of the info shown in the terminal is the following:

INFO 2018-08-16 15:56:10 SortingCollection Creating merging iterator from 32 files INFO 2018-08-16 15:58:53 AbstractAlignmentMerger Written in coordinate order to output 10,000,000 records. Elapsed time: 00:02:42s. Time for last 10,000,000: 161s. Last read position: 14:81,609,540 INFO 2018-08-16 16:00:24 AbstractAlignmentMerger Wrote 15475840 alignment records and 0 unmapped reads. [Thu Aug 16 16:00:24 CEST 2018] picard.sam.MergeBamAlignment done. Elapsed time: 26.48 minutes.

Shall I post the whole standard-out?

From yfarjoun on 2018-08-16

I’d like to see the first few lines of that output please.

From macko.tipor on 2018-08-16

In the experiment running now:

The sort order of the input file:

vaszko@seq:~/NextSeq_Runs/Evaluations/QIASeq/Results/QIASeq_01/RUN_analysis/Picard/uBAM/MergedFromLanes/Downsampled$ samtools view -H 559T_S6.downsampled.bam | grep ‘@HD’ @HD VN:1.5 GO:none SO:queryname

The start and end of the info shown in the terminal:

15:33:55.641 INFO NativeLibraryLoader – Loading libgkl_compression.so from jar:file:/home/vaszko/Softwares/picard-2.18.11/picard.jar!/com/intel/gkl/native/libgkl_compression.so [Thu Aug 16 15:33:55 CEST 2018] SamToFastq INPUT=/home/vaszko/NextSeq_Runs/Evaluations/QIASeq/Results/QIASeq_01/RUN_analysis/Picard/uBAM/MergedFromLanes/Downsampled/559T_S6.downsampled.bam FASTQ=/dev/stdout INTERLEAVE=true CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 TMP_DIR=[/home/vaszko/tempstore] OUTPUT_PER_RG=false COMPRESS_OUTPUTS_PER_RG=false RG_TAG=PU RE_REVERSE=true INCLUDE_NON_PF_READS=false CLIPPING_MIN_LENGTH=0 READ1_TRIM=0 READ2_TRIM=0 INCLUDE_NON_PRIMARY_ALIGNMENTS=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false [Thu Aug 16 15:33:55 CEST 2018] Executing as vaszko@seq on Linux 4.15.0-30-generic amd64; Java HotSpot™ 64-Bit Server VM 1.8.0_181-b13; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.11-SNAPSHOT 15:33:55.950 INFO NativeLibraryLoader – Loading libgkl_compression.so from jar:file:/home/vaszko /Softwares/picard-2.18.11/picard.jar!/com/intel/gkl/native/libgkl_compression.so [Thu Aug 16 15:33:56 CEST 2018] MergeBamAlignment UNMAPPED_BAM=/home/vaszko/NextSeq_Runs/Evaluations/QIASeq/Results/QIASeq_01/RUN_analysis/Picard/uBAM/MergedFromLanes/Downsampled/559T_S6.downsampled.bam ALIGNED_BAM=[/dev/stdin] OUTPUT=/home/vaszko/NextSeq_Runs/Evaluations/QIASeq/Results/QIASeq_01/RUN_analysis/Picard/BAM/Merged/559T_S6.merged.bam ALIGNED_READS_ONLY=true MAX_INSERTIONS_OR_DELETIONS=-1 ADD_MATE_CIGAR=true CREATE_INDEX=false REFERENCE_SEQUENCE=/home/vaszko/Refseqs/b37_2013-12-08/human_g1k_v37.fasta ADD_PG_TAG_TO_READS=true PAIRED_RUN=true CLIP_ADAPTERS=true IS_BISULFITE_SEQUENCE=false ATTRIBUTES_TO_REVERSE=[OQ, U2] ATTRIBUTES_TO_REVERSE_COMPLEMENT=[E2, SQ] READ1_TRIM=0 READ2_TRIM=0 ALIGNER_PROPER_PAIR_FLAGS=false SORT_ORDER=coordinate PRIMARY_ALIGNMENT_STRATEGY=BestMapq CLIP_OVERLAPPING_READS=true INCLUDE_SECONDARY_ALIGNMENTS=true UNMAP_CONTAMINANT_READS=false MIN_UNCLIPPED_BASES=32 MATCHING_DICTIONARY_TAGS=[M5, LN] UNMAPPED_READ_STRATEGY=DO_NOT_CHANGE VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false [Thu Aug 16 15:33:56 CEST 2018] Executing as vaszko@seq on Linux 4.15.0-30-generic amd64; Java HotSpot™ 64-Bit Server VM 1.8.0_181-b13; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.11-SNAPSHOT INFO 2018-08-16 15:33:56 SamAlignmentMerger Processing SAM file(s): [/dev/stdin] . . . INFO 2018-08-16 16:02:39 SortingCollection Creating merging iterator from 56 files INFO 2018-08-16 16:04:45 AbstractAlignmentMerger Written in coordinate order to output 10,000,000 records. Elapsed time: 00:02:05s. Time for last 10,000,000: 125s. Last read position: 7:6,776,787 INFO 2018-08-16 16:06:50 AbstractAlignmentMerger Written in coordinate order to output 20,000,000 records. Elapsed time: 00:04:10s. Time for last 10,000,000: 124s. Last read position: 17:29,663,821 INFO 2018-08-16 16:08:24 AbstractAlignmentMerger Wrote 27042371 alignment records and 0 unmapped reads. [Thu Aug 16 16:08:24 CEST 2018] picard.sam.MergeBamAlignment done. Elapsed time: 34.47minutes. Runtime.totalMemory()=3066560512

The sort order of the output file:

vaszko@seq:~/NextSeq_Runs/Evaluations/QIASeq/Results/QIASeq_01/RUN_analysis/Picard/BAM/Merged$ samtools view -H 559T_S6.merged.bam | grep ‘@HD‘ @HD VN:1.5 SO:coordinate

As I reviewed the command line, I noticed further strange things:

The arguments for PRIMARY_ALIGNMENT_STRATEGY and INCLUDE_SECONDARY_ALIGNMENTS were “MostDistant” and “false”, respectively. The same kind of problem as with SORT_ORDER.

From yfarjoun on 2018-08-16

from your logs: “SORT_ORDER=coordinate”

From macko.tipor on 2018-08-17

Definitely. But the argument used for SORT_ORDER was “queryname”.

So do you think the source of the problem is in my function? I don’t yet see where it is. Do you have any idea? Anyway, I will test it in several different forms again.

Thanks again for your help!

From yfarjoun on 2018-08-17

could you “echo” your commandline before you execute it? that way you should be able to see the exact value the parameters have….I suspect that it isn’t “coordinate”

From macko.tipor on 2018-08-17

I echoed it and found that all the arguments were seen right inside the function.

Then I rewrote the whole function character by character.

Finally I found the reason. I have recently put some notes in the working version. One of them was in wrong position and “shadowed” all the options in the following lines. However, there was no error or exeption since the invisible options were either of not required status or had some default value. That is why the default values came into effect instead of the actual arguments.

So it works well now!

Sorry for the trouble and thank you very much for your help once again!

From jejacobs23 on 2018-10-05

Thanks for a great tutorial.

I have a question regarding the final comments on multiplexed samples. I’m not sure I understand why it is necessary to run MarkDuplicates twice (once prior to merging the cleaned .bam files from different lanes and once after merging). From my reading of the GATK Tool Documentation, MarkDuplicates is capable of detecting both optical and PCR duplicates. Would it be OK to merge all the cleaned .bam files first and then run MarkDuplicates once?

Thanks.

From shlee on 2018-10-09

Hi @jejacobs23,

You’ll find an explanation to your question in the discussion thread following [Tutorial#6747](https://gatkforums.broadinstitute.org/gatk/discussion/6747). Basically, the twice-marking is of interest to sequencing centers who care about keeping track of the quality of samples and runs and delivering on promised complexity and coverage, as the Broad Genomics Platforms does. The rest of the research community probably will only want to mark duplicates once, at sample aggregation.

From deena_b on 2019-11-21

Is the cleaning up of primer sequences no longer necessary with gatk4?

Does it happen under the hood? If so, where?

How do these small differences arise?

Report abuse