Best Practices for Variant Discovery in RNAseq

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2014-04-16

This article is part of the Best Practices documentation. See http://www.broadinstitute.org/gatk/guide/best-practices for the full documentation set.

This is our recommended workflow for calling variants in RNAseq data from single samples, in which all steps are performed per-sample. In future we will provide cohort analysis recommendations, but these are not yet available.

The workflow is divided in three main sections that are meant to be performed sequentially:

- Pre-processing: from raw RNAseq sequence reads (FASTQ files) to analysis-ready reads (BAM files)

- Variant discovery: from reads (BAM files) to variants (VCF files)

- Refinement and evaluation: genotype refinement, functional annotation and callset QC

Compared to the DNAseq Best Practices, the key adaptations for calling variants in RNAseq focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller, which are highlighted in the figure below.

Pre-Processing

The data generated by the sequencers are put through some pre-processing steps to make it suitable for variant calling analysis. The steps involved are: Mapping and Marking Duplicates; Split’N‘Trim; Local Realignment Around Indels (optional); and Base Quality Score Recalibration (BQSR); performed in that order.

Mapping and Marking Duplicates

The sequence reads are first mapped to the reference using STAR aligner (2-pass protocol) to produce a file in SAM/BAM format sorted by coordinate. The next step is to mark duplicates. The rationale here is that during the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process identifies these reads as such so that the GATK tools know they should ignore them.

Split’N‘Trim

Then, an RNAseq-specific step is applied: reads with N operators in the CIGAR strings (which denote the presence of a splice junction) are split into component reads and trimmed to remove any overhangs into splice junctions, which reduces the occurrence of artifacts. At this step, we also reassign mapping qualities from 255 (assigned by STAR) to 60 which is more meaningful for GATK tools.

Realignment Around Indels

Next, local realignment is performed around indels, because the algorithms that are used in the initial mapping step tend to produce various types of artifacts. For example, reads that align on the edges of indels often get mapped with mismatching bases that might look like evidence for SNPs, but are actually mapping artifacts. The realignment process identifies the most consistent placement of the reads relative to the indel in order to clean up these artifacts. It occurs in two steps: first the program identifies intervals that need to be realigned, then in the second step it determines the optimal consensus sequence and performs the actual realignment of reads. This step is considered optional for RNAseq.

Base Quality Score Recalibration

Finally, base quality scores are recalibrated, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic error, leading to over- or under-estimated base quality scores in the data. Base quality score recalibration is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This yields more accurate base qualities, which in turn improves the accuracy of the variant calls. The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants, then it adjusts the base quality scores in the data based on the model.

Variant Discovery

Once the data has been pre-processed as described above, it is put through the variant discovery process, i.e. the identification of sites where the data displays variation relative to the reference genome, and calculation of genotypes for each sample at that site. Because some of the variation observed is caused by mapping and sequencing artifacts, the greatest challenge here is to balance the need for sensitivity (to minimize false negatives, i.e. failing to identify real variants) vs. specificity (to minimize false positives, i.e. failing to reject artifacts). It is very difficult to reconcile these objectives in a single step, so instead the variant discovery process is decomposed into separate steps: variant calling (performed per-sample) and variant filtering (also performed per-sample). The first step is designed to maximize sensitivity, while the filtering step aims to deliver a level of specificity that can be customized for each project.

Our current recommendation for RNAseq is to run all these steps per-sample. At the moment, we do not recommend applying the GVCF-based workflow to RNAseq data because although there is no obvious obstacle to doing so, we have not validated that configuration. Therefore, we cannot guarantee the quality of results that this would produce.

Per-Sample Variant Calling

We perform variant calling by running the HaplotypeCaller on each sample BAM file (if a sample’s data is spread over more than one BAM, then pass them all in together) to create single-sample VCFs containing raw SNP and indel calls.

Per-Sample Variant Filtering

For RNAseq, it is not appropriate to apply variant recalibration in its present form. Instead, we provide hard-filtering recommendations to filter variants based on specific annotation value thresholds. This produces a VCF of calls annotated with fiiltering information that can then be used in downstream analyses.

Refinement and evaluation

In this last section, we perform some refinement steps on the genotype calls (GQ estimation and transmission phasing), add functional annotations if desired, and do some quality evaluation by comparing the callset to known resources. None of these steps are absolutely required, and the workflow may need to be adapted quite a bit to each project’s requirements.

Important note on GATK versions

The [Best Practices](http://www.broadinstitute.org/gatk/guide/best-practices) have been updated for GATK version 3. If you are running an older version, you should seriously consider upgrading. For more details about what has changed in each version, please see the [Version History](http://www.broadinstitute.org/gatk/guide/version-history) section. If you cannot upgrade your version of GATK for any reason, please look up the corresponding version of the GuideBook PDF (also in the [Version History](http://www.broadinstitute.org/gatk/guide/version-history) section) to ensure that you are using the appropriate recommendations for your version.

Attachments:

ASHG-Ami-RNAseq_poster.png

BP_workflow_RNA_3.2.png

Updated on 2015-05-16

From JCGrenier on 2014-08-18

Hi Geraldine,

I have a quick question about the pipeline itself. Where does the Split’N‘trim has an effect in reality? Will this affect the recalibration step or only the snp/indel calling (Haplotype Caller) part?

Thanks a lot!!

From Geraldine_VdAuwera on 2014-08-18

Hi @JCGrenier‌,

That processing is meant to help primarily in the variant calling part (HC), but it may also lead to some minor improvements in BQSR.

From JCGrenier on 2014-08-18

Thanks a lot for your quick reply!

From mxqian on 2014-10-10

GATK Best Practices for RNA-seq suggests using the STAR 2-pass alignment steps to do the mapping. It sounds very good. But in operation, maybe there are some issues. As shown in the methods, for every sample, the genome index should be generated twice, which is time-consuming for big genome such as human. Moreover, it seems that we can not do alignment for two samples at same time with the same genome path, right? Should we copy a genome.fa for every sample in an individual place? Seems crazy for a big project.

Any suggestion?

From Geraldine_VdAuwera on 2014-10-10

Hi @mxqian,

The developer of STAR will be better able to answer your questions about his program. He has a support mailing list here: [https://groups.google.com/forum/#!forum/rna-star](https://groups.google.com/forum/#!forum/rna-star).

From mxqian on 2014-10-13

yes, I will try to find solution there. But here, I just want to know how you did it with the best practice actually? Just test samples one by one? seems impossible.

From Geraldine_VdAuwera on 2014-10-14

@mxqian‌

In our work (which has been limited to a fairly small number of samples) we have done it one by one, yes. However I know that there are ways to speed up the process if you are dealing with many samples. Alex Dobin (the developer of STAR) will be able to tell you what is the best way to do this.

From mxqian on 2014-10-14

OK, Got it. Thank you very much. I have post my question to the rna-star forum.

From jonathan.barlev on 2015-08-06

1. Could you explain why is mark duplicates used? Don’t you expect to see many duplicate reads arising from genuine copies of the same RNA transcript? Won’t this outweigh PCR duplication? In the RealignerTargetCreator step about 34% of my reads fail the “duplicate read” filter.

2. Regarding RealignerTargetCreator, should splice junction information be incorporated in some way? Won’t splices be misinterpreted as deletions?

3. Regarding IndelRealigner, I have 4 samples – should I input all 4 at this step (will it merge them into a single large output .bam?), or should indel realignment be run seperately on each sample. In the IndelRealigner step, I provided all 4 together.

Thanks,

From ami on 2015-08-13

hi,

1) since this pipeline is mainly for variant calling in RNA-seq, we found that the duplications add more errors than advantages. However, for many RNA-seq applications, it is better to keep the duplication. If there is a real variant in the RNA, you should be able to call it even after deduping. If you have a high rate to duplication, it might be good to try to run the pipeline with and without that step and compare the calls.

2) In our recommended pipeline the Indel Realignment step runs after the splitN’Trim step, which removes the splices, and thus there is no problems with them.

3) I would run all the steps for each sample separately.

As far as I know there is no advantage in merging the bams. @Geraldine_VdAuwera can confirmed.

From Sheila on 2015-08-17

@jonathan.barlev

Hi,

Sorry for the late response.

1) We recommend running Mark Duplicates and using it for every stage of your analysis. However, for differential expression analysis, we do not recommend using Mark Duplicates. So, when you use ASEReadCounter, you can simply disable the duplicate read filter using -drf DuplicateRead

2) As long as you run SplitN’Trim, you don’t have to worry about splice junctions.

3) You can run IndelRealigner on all of your bam files, but it will produce 1 bam file as output. If you would like 1 output bam for each input bam, you can use -nWayOut. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php#—nWayOut

-Sheila

From pjl_happy on 2015-10-16

Hi,

Since the pipeline now doesn’t provide cohort analysis for RNA-seq, I am wondering what should I do if I have hunderds of RNA-seq samples and want to call the SNPs among them. Should I first call SNPs for single samples, and after all the samples are done, then combine variants from different files into one? But I have noticed that it’s not suggested (http://gatkforums.broadinstitute.org/discussion/53/combining-variants-from-different-files-into-one). So what’s your suggestion if you have hunderds of RNA-seq samples?

From Geraldine_VdAuwera on 2015-10-17

We have only validated single-sample discovery in RNAseq, but in theory there is no technical obstacle that I can think of that would prevent you from using the GVCF workflow.

From imamovic on 2016-01-06

Hi Geraldine,

I’m looking into streamlining the pre-processing steps into an automated pipeline. What would be the best way to do that? Has anyone done it already? Any suggestions?

Thanks,

Alma

From Geraldine_VdAuwera on 2016-01-06

That’s a big question, @imamovic. I see you’re at the Broad — are you aware that there are some internal services that offer this capability? Any particular reason you’re looking to roll your own?

From imamovic on 2016-01-07

I’m actually at DFCI primarily and sometimes working with bam files not produced at the Broad. Which internal services are you referring to in particular?

From Geraldine_VdAuwera on 2016-01-07

Ah, I see. I’m referring to Firehose and its new cloud-based incarnation, [FireCloud](http://gatkforums.broadinstitute.org/firecloud). The latter is still in active development but already available for alpha testing.

Otherwise you have two main options that we recommend/support: one is to use Queue to write a pipeline script that you can run on the cluster, the other is to write a script in WDL (the new worfkflow language developed for FireCloud and related projects) which you can run on a variety of platforms. For WDL we’re currently doing a big push to produce documentation and example scripts.

From imamovic on 2016-01-11

Which workflow if Firehose would you recommend for pre-processing RNA-Seq bams?

From Geraldine_VdAuwera on 2016-01-11

To be honest I don’t actually know that there is a pre-baked workflow to do RNAseq in Firehose — it’s run by a different team and we don’t have any visibility into it. But I would recommend asking whoever is doing support for Firehose because I’d be surprised if there wasn’t a workflow for this already.

From l0o0 on 2016-03-23

@Geraldine_VdAuwera

In the practice, data are aligned by STAT. Is it possible to use Tophat to align instead of STAR?

I have some bam files generated by Tophat, if I want to call snp from these bam files according this practice, is there any step I should pay attention to?

From Sheila on 2016-03-23

@l0o0

Hi,

Have a look at [this thread](http://gatkforums.broadinstitute.org/gatk/discussion/6215/using-tophat-accepted-hits-bam-file-for-rnaseq-variant-calling-with-gatk).

-Sheila

From khoogar on 2017-04-06

Hi,

I have single cell RNA-Seq data. Could you guide me please how I can use this pipeline for my data?

From Sheila on 2017-04-11

@khoogar

Hi,

I am not sure I understand what you are asking? Please give us more information about what you are trying to accomplish.

Thanks,

Sheila

Report abuse