RNAseq short variant discovery SNPs Indels

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2018-01-09

Purpose

Identify short variants (SNPs and Indels) in RNAseq data.

Reference Implementations

| Pipeline | Summary | Notes | Github | Terra | |:-------------|:---------------|:---------|:-----------|:--------------| | RNAseq short variant per-sample calling | BAM to VCF | universal (expected) | yes | TBD |

Expected input

This workflow is designed to operate on a set of samples (uBAM files) one-at-a-time; joint calling RNAseq is not supported.

Main Steps

Mapping to the Reference

Tools involved: STAR

We begin with mapping RNA reads to a reference, we recommend using STAR aligner because it increased sensitivity compared to TopHat (especially for INDELS). We use STAR’s two-pass mode to get better alignments around novel splice junctions.

Data Cleanup

Tools involved: MergeBamAlignment, MarkDuplicates

We use MergeBamAlignment and MarkDuplicates (similarly to our DNA pre-processing best practices pipeline)

SplitNCigarReads

Tools involved: SplitNCigarReads

Because RNA aligners have different conventions than DNA aligners, we need to reformat some of the alignments that span introns for HaplotypeCaller. This step splits reads with N in the cigar into multiple supplementary alignments and hard clips mismatching overhangs. By default this step also reassigns mapping qualities for good alignments to match DNA conventions.

Base Quality Recalibration

Tools involved: BaseRecalibrator, Apply Recalibration, AnalyzeCovariates (optional)

This step is performed per-sample and consists of applying machine learning to detect and correct for patterns of systematic errors in the base quality scores, which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it's important to correct any systematic bias observed in the data. Biases can originate from biochemical processes during library preparation and sequencing, from manufacturing defects in the chips, or instrumentation defects in the sequencer. The recalibration procedure involves collecting covariate statistics from all base calls in the dataset, building a model from those statistics, and applying base quality adjustments to the dataset based on the resulting model. The initial statistics collection can be parallelized by scattering across genomic coordinates, typically by chromosome or batches of chromosomes but this can be broken down further to boost throughput if needed. Then the per-region statistics must be gathered into a single genome-wide model of covariation; this cannot be parallelized but it is computationally trivial, and therefore not a bottleneck. Finally, the recalibration rules derived from the model are applied to the original dataset to produce a recalibrated dataset. This is parallelized in the same way as the initial statistics collection, over genomic regions, then followed by a final file merge operation to produce a single analysis-ready file per sample.

Variant Calling

Tools involved: HaplotypeCaller

HaplotypeCaller doesn’t need any specific changes to run with RNA once the bam has been run through SplitNCigarReads. We do adjust the minimum phred-scaled confidence threshold for calling variants to 20, but this value will depend on your specific use case.

Variant Filtering

Tools involved: VariantFiltration

We recommend specific hard filters, since VQSR and CNNScoreVariants require truth data for training that we don’t yet have for RNA.

Updated on 2019-07-11

From maxflorian on 2018-01-23

Hi GATK development team,

I am currently using GATK for variant calling on RNAseq data. I think my analysis would really benefit from joint variant calling. So I have two questions:

Do you know when joint calling will be officialy implemented for RNAseq?
Meantime, is there a way to perform joint calling anyway (incremental analysis is not neccesary since I have less than 20 runs and enough computational resources)?

Thank you for your help!

From sbourgeois on 2018-03-06

Hi @Geraldine_VdAuwera ,

first of all, congrats to the whole team, this is great work, and the ability to all use a comparable pipeline should prove very valuable.

Would you have any estimate as to when the RNAseq pipeline would be available on Firecloud?

Also, the summary indicates BAM to VCF, and not uBAM, should we understand that mapping isn’t part of the pipeline? In this case, as the aligner seems to be one of the largest source of discrepancies between pipelines, wouldn’t that somewhat defeat the purpose? (I’m particularly thinking of comparison with exomes in GnomAD)

Cheers,

Steph

From Geraldine_VdAuwera on 2018-03-07

Thanks @sbourgeois :)

We should be able to get the pipeline into FC fairly quickly if that would be helpful to you.

Due to the idiosyncrasies of the project this particular WDL was developed for internally, it has a built-in RevertSam step that allows it to take any aligned bam and reprocess it from scratch, including alignment with STAR. Ultimately I would like to take that out to standardize on uBam as start point to the pipeline. We already have a separate WDL for the reversion process for those who need it.

But that would add on some time for the script modifications; we could already make the existing version available in FC if you think you could work with that.

From sbourgeois on 2018-03-09

Hi @Geraldine_VdAuwera ,

there is no emergency, I think it’s better to wait until the WDL is modified in order to start from uBAMs; it would be quite wasteful to align the reads just to then get the resulting files through RevertSam.

Thanks a lot :smile:

From clareau on 2018-07-16

Hi GATK development team,

I was planning on doing genotyping of RNA-seq data for several samples this week. I noticed that the GATK4 RNA-seq workflow is still “in development”. Would you all recommend still genotyping RNA-seq data with GATK3 to ensure accuracy at this time? Or is GATK4 fine to use (I can write my own WDL methods if needed)?

Thanks!

-Caleb

From Sheila on 2018-07-24

@clareau

Hi Caleb,

I am assuming you are asking about the workflow [here](https://github.com/gatk-workflows/gatk3-4-rnaseq-germline-snps-indels/blob/master/rna-germline-variant-calling.wdl). In that workflow, HaplotypeCaller and SplitNCigarReads still use GATK3. You should be fine substituting GATK4 in those. The team has ported those tools in GATK4 but have not validated them in the workflow yet. It would be great if you could try it out and let us know how it works.

-Sheila

From Nemanja on 2019-01-23

Hello GATK development team,

I was wondering if using the latest version of STAR aligner instead of the one that is in WDL would constitute a significant divergence from GATK Best Practices for RNAseq short variant discovery workflow?

From seahearman on 2019-01-23

Hi GATK support team,

I am trying to follow the Best practices, but it always throws the error that “Error retrieving content: Could not parse content retrieved from forum database.”

would you please check that?

Thanks

From TomWillDo on 2019-07-15

I notice there’s some changes here from the best practises I used before this page was released (of course, just after I submitted my honours thesis!)

Some questions:

MergeBamAlignment: where do I get the unaligned BAM files to be merged with the aligned ones? Is this an additional command I need to add to STAR?

ApplyBQSR: I used GATK3 previously, is there this options for the older tools? Are there any other issues I should be aware of if I stick to GATK3?

Thanks!

From deena_b on 2019-12-09

Is there a reason that this tutorial isn’t listed on the GATK4 tutorial page below?

https://software.broadinstitute.org/gatk/documentation/topic?name=tutorials

Report abuse