created by shlee
on 2016-03-08
This tutorial replaces Tutorial#2800 and applies to data types within the scope of the GATK Best Practices variant discovery workflow.
We provide example data and example commands for performing local realignment around small insertions and deletions (indels) against a reference. The resulting BAM reduces false positive SNPs and represents indels parsimoniously. First we use RealignerTargetCreator to identify and create a target intervals list (step 1). Then we perform local realignment for the target intervals using IndelRealigner (step 2).
Jump to a section
Why do indel realignment?
Local realignment around indels allows us to correct mapping errors made by genome aligners and make read alignments more consistent in regions that contain indels.
Genome aligners can only consider each read independently, and the scoring strategies they use to align reads relative to the reference limit their ability to align reads well in the presence of indels. Depending on the variant event and its relative location within a read, the aligner may favor alignments with mismatches or soft-clips instead of opening a gap in either the read or the reference sequence. In addition, the aligner's scoring scheme may use arbitrary tie-breaking, leading to different, non-parsimonious representations of the event in different reads.
In contrast, local realignment considers all reads spanning a given position. This makes it possible to achieve a high-scoring consensus that supports the presence of an indel event. It also produces a more parsimonious representation of the data in the region .
This two-step indel realignment process first identifies such regions where alignments may potentially be improved, then realigns the reads in these regions using a consensus model that takes all reads in the alignment context together.
Tools involved
Prerequisites
Download example data
human_g1k_v37_decoy.fasta.gz
, .fasta.fai.gz
, and .dict.gz
. This same reference is available to load in IGV.For simplicity, we use a single known indels VCF, included in the tutorial data. For recommended resources, see Article#1247.
In the command, RealignerTargetCreator takes a coordinate-sorted and indexed BAM and a VCF of known indels and creates a target intervals file.
java -jar GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R human_g1k_v37_decoy.fasta \ -L 10:96000000-97000000 \ -known INDEL_chr10_1Mb_b37_1000G_phase3_v4_20130502.vcf \ -I 7156_snippet.bam \ -o 7156_realignertargetcreator.intervals
In the resulting file, 7156_realignertargetcreator.intervals
, intervals represent sites of extant and potential indels. If sites are proximal, the tool represents them as a larger interval spanning the sites.
Comments on specific parameters
-I
.-known
. The known indels VCF contains indel records only.-known
, (ii) one or more alignment BAMs each passed in via -I
or (iii) both. We recommend the last mode, and we use it in the example command. We use these same input files again in the realignment step.-L 10:96000000-97000000
in the command to limit processing time. Otherwise, the tool traverses the entire reference genome and intervals outside these coordinates may be added given our example 7156_snippet.bam
contains a small number of alignments outside this region.The target intervals file
The first ten rows of 7156_realignertargetcreator.intervals
are as follows. The file is a text-based one-column list with one interval per row in 1-based coordinates. Header and column label are absent. For an interval derived from a known indel, the start position refers to the corresponding known variant. For example, for the first interval, we can zgrep -w 96000399 INDEL_chr10_1Mb_b37_1000G_phase3_v4_20130502.vcf
for details on the 22bp deletion annotated at position 96000399.
10:96000399-96000421 10:96002035-96002036 10:96002573-96002577 10:96003556-96003558 10:96004176-96004177 10:96005264-96005304 10:96006455-96006461 10:96006871-96006872 10:96007627-96007628 10:96008204
To view intervals on IGV, convert the list to 0-based BED format using the following AWK command. The command saves a new text-based file with .bed
extension where chromosome, start and end are tab-separated, and the start position is one less than that in the intervals list.
awk -F '[:-]' 'BEGIN { OFS = "\t" } { if( $3 == "") { print $1, $2-1, $2 } else { print $1, $2-1, $3}}' 7156_realignertargetcreator.intervals > 7156_realignertargetcreator.bed
In the following command, IndelRealigner takes a coordinate-sorted and indexed BAM and a target intervals file generated by RealignerTargetCreator. IndelRealigner then performs local realignment on reads coincident with the target intervals using consenses from indels present in the original alignment.
java -Xmx8G -Djava.io.tmpdir=/tmp -jar GenomeAnalysisTK.jar \ -T IndelRealigner \ -R human_g1k_v37_decoy.fasta \ -targetIntervals 7156_realignertargetcreator.intervals \ -known INDEL_chr10_1Mb_b37_1000G_phase3_v4_20130502.vcf \ -I 7156_snippet.bam \ -o 7156_snippet_indelrealigner.bam
The resulting coordinate-sorted and indexed BAM contains the same records as the original BAM but with changes to realigned records and their mates. Our tutorial's two IGV screenshots show realigned reads in two different loci. For simplicity, the screenshots show the subset of reads that realigned. For screenshots of full alignments for the same loci, see here and here.
Comments on specific parameters
-targetIntervals
file from RealignerTargetCreator, with extension .intervals
or .list
, is required. See section 1 for a description.-I
. IndelRealigner operates on all reads simultaneously in files you provide it jointly.-known
.-nWayOut
instead of -o
.USE_READS
consensus model. This is the consensus model we recommend because it balances accuracy and performance. To specify a different model, use the -model
argument. The KNOWNS_ONLY
consensus model constructs alternative alignments from the reference sequence by incorporating any known indels at the site, the USE_READS
model from indels in reads spanning the site and the USE_SW
model additionally from Smith-Waterman alignment of reads that do not perfectly match the reference sequence.KNOWNS_ONLY
model can be sufficient for preparing data for base quality score recalibration. It can maximize performance at the expense of some accuracy. This is the case only given the known indels file represents common variants for your data. If you specify -model KNOWNS_ONLY
but forget to provide a VCF, the command runs but the tool does not realign any reads.-Xmx8G
. To find a system's default maximum heap size, type java -XX:+PrintFlagsFinal -version
, and look for MaxHeapSize
. If this does not help, and you are jointly processing data, then try running indel realignment iteratively on smaller subsets of data before processing them jointly.-maxReads
parameter, then the tool skips the region.Changes to alignment records
For our example data,194 alignment records realign for ~89 sites. These records now have the OC
tag to mark the original CIGAR string. We can use the OC
tag to pull out realigned reads and instructions for this are in section 4. The following screenshot shows an example pair of records before and after indel realignment. We note seven changes with asterisks, blue for before and red for after, for both the realigned read and for its mate.
Changes to the example realigned record:
72M20I55M4S
, reflects the realignment containing a 20bp insertion.Changes to the realigned read's mate record:
RealignerTargetCreator documentation has a -maxInterval
cutoff to drop intervals from the list if they are too large. This is because increases in number of reads per interval quadratically increase the compute required to realign a region, and larger intervals tend to include more reads. By the same reasoning, increasing read depth, e.g. with additional alignment files, increases required compute.
Our tutorial's INDEL_chr10_1Mb_b37_1000G_phase3_v4_20130502.vcf
contains 1168 indel-only records. The following are metrics on intervals created using the three available options.
#intervals avg length basepair coverage VCF only 1161 3.33 3,864 BAM only 487 15.22 7,412 VCF+BAM 1151 23.07 26,558
You can project the genomic coverage of the intervals as a function of the interval density (number of intervals per basepair) derived from varying the known indel density (number of indel records in the VCF). This in turn allows you to anticipate compute for indel realignment. The density of indel sites increases the interval length following a power law (y=ax^b). The constant (a) and the power (b) are different for intervals created with VCF only and with VCF+BAM. For our example data, these average interval lengths are well within the length of a read and minimally vary the reads per interval and thus the memory needed for indel realignment.
Realign Indels
icon for best practice recommendations and links including to a 14-minute video overview.samtools view 7088_snippet_indelrealigner.bam | grep 'OC' | cut -f1 | sort > 7088_OC.txt
to create a list of readnames. Then, follow direction in blogpost SAM flags down a boat on how to create a valid BAM using FilterSamReads.-nt
threads, while IndelRealigner shows diminishing returns for increases in scatter-gather threads provided by Queue. See blog How long does it take to run the GATK Best Practices? for a breakdown of the impact of threading and CPU utilization for Best Practice Workflow tools.-selectType INDEL
option. Note this excludes indels that are part of mixed variant sites (see FAQ). Current solutions to including indels from mixed sites involves the use of JEXL expressions, as discussed here. Current solutions to selecting variants based on population allelic frequency (AF), as we may desire to limit our known indels to those that are more common than rare for more efficient processing, are discussed in two forum posts (1,2).Updated on 2017-10-02
From medhat on 2016-05-30
in the log file I found
INFO 11:59:35,228 IndelRealigner – Not attempting realignment in interval 10:121027618-121028073 because there are too many reads.
any suggestion?
From Sheila on 2016-05-30
@medhat
Hi,
You can try increasing [-maxReadsForRealignment](https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php#—maxReadsForRealignment) and [-maxReadsInMemory](https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php#—maxReadsInMemory).
-Sheila
From andylane on 2017-09-21
It appears that the Google drive link to the tutorial file is not working for now. I found an identically named file on the FTP server:
ftp://gsapubftp-anonymous@ftp.broadinstitute.org/tutorials/datasets/tutorial_7156.tar.gz
From Sheila on 2017-09-26
@andylane
Hi,
For some reason I cannot edit the document right now, but the correct link is ftp://ftp.broadinstitute.org/bundle/b37/
Thanks for letting us know.
-Sheila
From shlee on 2017-10-02
@Sheila, it appears Chrome is having security certificate issues again. I’m able to update this link in Safari.
Thanks @andylane for this.