Local Realignment around Indels

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by delangel

on 2012-07-23

For a discussion of the implications of removing indel realignment from workflows, see Blog#7847 from June 2016.

Realigner Target Creator

For a complete, detailed argument reference, refer to the GATK document page here.

Indel Realigner

For a complete, detailed argument reference, refer to the GATK document page here.

Running the Indel Realigner only at known sites

While we advocate for using the Indel Realigner over an aggregated bam using the full Smith-Waterman alignment algorithm, it will work for just a single lane of sequencing data when run in -knownsOnly mode. Novel sites obviously won't be cleaned up, but the majority of a single individual's short indels will already have been seen in dbSNP and/or 1000 Genomes. One would employ the known-only/lane-level realignment strategy in a large-scale project (e.g. 1000 Genomes) where computation time is severely constrained and limited. We modify the example arguments from above to reflect the command-lines necessary for known-only/lane-level cleaning.

The RealignerTargetCreator step would need to be done just once for a single set of indels; so as long as the set of known indels doesn't change, the output.intervals file from below would never need to be recalculated.

java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R /path/to/reference.fasta \ -o /path/to/output.intervals \ -known /path/to/indel_calls.vcf

The IndelRealigner step needs to be run on every bam file.

java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir \ -jar /path/to/GenomeAnalysisTK.jar \ -I <lane-level.bam> \ -R <ref.fasta> \ -T IndelRealigner \ -targetIntervals <intervalListFromStep1Above.intervals> \ -o <realignedBam.bam> \ -known /path/to/indel_calls.vcf --consensusDeterminationModel KNOWNS_ONLY \ -LOD 0.4

Updated on 2016-06-21

From medhat on 2016-05-25

IT gives me this error : ##### ERROR MESSAGE: SAM/BAM/CRAM file htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter@301da01d appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 65. Please see https://www.broadinstitute.org/gatk/guide?id=6470 for more details and options related to this error. Any IDea?? "this link does not take me to any place"

the command is : java -Djava.io.tmpdir=./temp/ -jar GenomeAnalysisTK.jar -T RealignerTargetCreator scores -R genome.fa -known maizesnpsv3.vcf -I inputmergedlanesorteddedup.bam -o inputmergedlanesorteddedup_intervals

Using GATK v3.5

From Sheila on 2016-05-25

@medhat

Hi,

It looks like that link should be: https://www.broadinstitute.org/gatk/guide/article?id=6470

-Sheila

From medhat on 2016-05-26

thanks a lot it was fixed by adding flag —fix_misencoded_quality_score

From Sheila on 2016-05-26

@medhat

Great. I just made a note to fix the error message so it points to the correct article :smile:

From daxue on 2016-06-01

The link to Step-by-step tutorial(s), (howto) Perform local realignment around indels, is broken: https://www.broadinstitute.org/gatk/guide/article?id=2800 . Some other howto links also not working. Could you please fix it? Thanks.

From daxue on 2016-06-01

The link is in Best Practice -> Pre-processing -> Realign Indels.

From Sheila on 2016-06-01

@daxue

Hi,

I suspect those are being fixed as we talk on the thread! :smiley:

As for the how to, [here](https://www.broadinstitute.org/gatk/guide/article?id=7156) it is. This new article replaces the old article (2800).

-Sheila

From Geraldine_VdAuwera on 2016-06-01

Yes this is actually fixed in web dev, and we’ll be pushing the updates later this week following the 3.6 release (which is happening today).

From CardiffBioinf on 2016-10-10

Dear GATK team

I was hoping you could advise how to resolve the 54bp deletion in the attached screen shot. This data is from a tumour-only amplicon experiment. It actually does get called with HaplotypeCaller but I understand it’s not really suitable to use the local reassembly tool with low complexity data, is that correct? So I’ve been trying to force IR to resolve it by manipulating the LOD, greedy and max consensuses values but the gap wont open. I have checked the region is included in the interval file.

Any thoughts or suggestions much appreciated.

Thanks

Matt

Screen Shot 2016-10-10 at 14.38.02.png

From Sheila on 2016-10-13

@CardiffBioinf

Hi Matt,

For tumor data, you should be using [MuTect2](https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_cancer_m2_MuTect2.php).

-Sheila

From mariel on 2016-10-28

Hello,

I have run indel realignment on whole genome data. After indel realignment I got an high frequency of deletions at the end of the reads (up to 0.16). These high deletion frequencies at the end of the reads did not appear before running indel realignment. Is this normal? I have included plots of indel frequencies (after indel realignment and before indel realignment).

Thanks,

Marie

indel_frequencies.png

Unknown.png

From Sheila on 2016-11-01

@mariel

Hi Marie,

Indeed, we expect a higher number of indels at the ends of reads. Have a look at [the tool documentation](https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php) and the Indel Realignment presentation [here](https://drive.google.com/drive/folders/0BwTg3aXzGxEDVE10Z01yWHVtYlU).

-Sheila

From Sheila on 2016-11-01

@mariel

Hi again Marie,

To clarify, Indel Realignment enables you to identify indels that you would normally miss, especially at the ends of reads. This is because mappers tend to make mistakes at the ends of reads. So we do expect to see more indels at ends of reads after running Indel Realignment. However, we don’t expect to see overall more indels covered only by ends of reads. If that is what you are seeing, this may actually be a sign of artifacts.

Can you tell us more about the sequencing technology you used? Is it exome data? How much coverage do you have? In low coverage exome data, you could have more sites that are only covered by ends of reads, which skews the expectations. Have you done the variant calling and gotten the TiTv ratio?

Thanks,

Sheila

From mariel on 2016-11-02

@Sheila

Hi Sheila,

Thanks for your answer. Yes I see overall more indels covered only by ends of reads.

I am using PE150bp sequencing on an Illumina HiSeq XTen. It is whole genome re-sequencing. I am mapping against the reference genome of another (related) species. I have a mean coverage of 10x.

I haven’t done the variant calling, I’m using ANGSD for SNP calling and genotype likelihoods estimation as I have relatively low coverage data.

Thanks,

Marie

From Geraldine_VdAuwera on 2016-11-08

That sounds like it could be an artifact, perhaps caused (or aggravated) by mapping against a different species. It’s hard to say without a deeper analysis but I would definitely consider this a red flag. You’ll need to evaluate your variant calling results carefully. Unfortunately since it is a use case that we have no experience with we can’t help you beyond this.

From mariel on 2016-11-09

@Geraldine_VdAuwera

Thanks for your answer. I will look into the read alignment data.

Marie

Report abuse