Calling variants in RNAseq

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2014-03-06

Overview

This document describes the details of the GATK Best Practices workflow for SNP and indel calling on RNAseq data.

Please note that any command lines are only given as example of how the tools can be run. You should always make sure you understand what is being done at each step and whether the values are appropriate for your data. To that effect, you can find more guidance here.

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller. Here is a detailed overview:

Caveats

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have been working with RNAseq for a somewhat shorter time, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

We know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data.

The workflow

1. Mapping to the reference

The first major difference relative to the DNAseq Best Practices is the mapping step. For DNA-seq, we recommend BWA. For RNA-seq, we evaluated all the major software packages that are specialized in RNAseq alignment, and we found that we were able to achieve the highest sensitivity to both SNPs and, importantly, indels, using STAR aligner. Specifically, we use the STAR 2-pass method which was described in a recent publication (see page 43 of the Supplemental text of the Pär G Engström et al. paper referenced below for full protocol details -- we used the suggested protocol with the default parameters). In brief, in the STAR 2-pass approach, splice junctions detected in a first alignment run are used to guide the final alignment.

Here is a walkthrough of the STAR 2-pass alignment steps:

1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:

genomeDir=/path/to/hg19 mkdir $genomeDir STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa\ --runThreadN <n>

2) Alignment jobs were executed as follows:

runDir=/path/to/1pass mkdir $runDir cd $runDir STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass:

genomeDir=/path/to/hg19_2pass mkdir $genomeDir STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa \ --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>

4) The resulting index is then used to produce the final alignments as follows:

runDir=/path/to/2pass mkdir $runDir cd $runDir STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

2. Add read groups, sort, mark duplicates, and create index

The above step produces a SAM file, which we then put through the usual Picard processing steps: adding read group information, sorting, marking duplicates and indexing.

java -jar picard.jar AddOrReplaceReadGroups I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample java -jar picard.jar MarkDuplicates I=rg_added_sorted.bam O=dedupped.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics

3. Split'N'Trim and reassign mapping qualities

Next, we use a new GATK tool called SplitNCigarReads developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.

In the future we plan to integrate this into the GATK engine so that it will be done automatically where appropriate, but for now it needs to be run as a separate step.

At this step we also add one important tweak: we need to reassign mapping qualities, because STAR assigns good alignments a MAPQ of 255 (which technically means “unknown” and is therefore meaningless to GATK). So we use the GATK’s ReassignOneMappingQuality read filter to reassign all good alignments to the default value of 60. This is not ideal, and we hope that in the future RNAseq mappers will emit meaningful quality scores, but in the meantime this is the best we can do. In practice we do this by adding the ReassignOneMappingQuality read filter to the splitter command.

Finally, be sure to specify that reads with N cigars should be allowed. This is currently still classified as an "unsafe" option, but this classification will change to reflect the fact that this is now a supported option for RNAseq processing.

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

4. Indel Realignment (optional)

After the splitting step, we resume our regularly scheduled programming... to some extent. We have found that performing realignment around indels can help rescue a few indels that would otherwise be missed, but to be honest the effect is marginal. So while it can’t hurt to do it, we only recommend performing the realignment step if you have compute and time to spare (or if it’s important not to miss any potential indels).

5. Base Recalibration

We do recommend running base recalibration (BQSR). Even though the effect is also marginal when applied to good quality data, it can absolutely save your butt in cases where the qualities have systematic error modes.

Both steps 4 and 5 are run as described for DNAseq (with the same known sites resource files), without any special arguments. Finally, please note that you should NOT run ReduceReads on your RNAseq data. The ReduceReads tool will no longer be available in GATK 3.0.

6. Variant calling

Finally, we have arrived at the variant calling step! Here, we recommend using HaplotypeCaller because it is performing much better in our hands than UnifiedGenotyper (our tests show that UG was able to call less than 50% of the true positive indels that HC calls). We have added some functionality to the variant calling code which will intelligently take into account the information about intron-exon split regions that is embedded in the BAM file by SplitNCigarReads. In brief, the new code will perform “dangling head merging” operations and avoid using soft-clipped bases (this is a temporary solution) as necessary to minimize false positive and false negative calls. To invoke this new functionality, just add -dontUseSoftClippedBases to your regular HC command line. Note that the -recoverDanglingHeads argument which was previously required is no longer necessary as that behavior is now enabled by default in HaplotypeCaller. Also, we found that we get better results if we set the minimum phred-scaled confidence threshold for calling variants 20, but you can lower this to increase sensitivity if needed.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I input.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -o output.vcf

7. Variant filtering

To filter the resulting callset, you will need to apply hard filters, as we do not yet have the RNAseq training/truth resources that would be needed to run variant recalibration (VQSR).

We recommend that you filter clusters of at least 3 SNPs that are within a window of 35 bases between them by adding -window 35 -cluster 3 to your command. This filter recommendation is specific for RNA-seq data.

As in DNA-seq, we recommend filtering based on Fisher Strand values (FS > 30.0) and Qual By Depth values (QD < 2.0).

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf

Please note that we selected these hard filtering values in attempting to optimize both high sensitivity and specificity together. By applying the hard filters, some real sites will get filtered. This is a tradeoff that each analyst should consider based on his/her own project. If you care more about sensitivity and are willing to tolerate more false positives calls, you can choose not to filter at all (or to use less restrictive thresholds).

An example of filtered (SNPs cluster filter) and unfiltered false variant calls:

An example of true variants that were filtered (false negatives). As explained in text, there is a tradeoff that comes with applying filters:

Known issues

There are a few known issues; one is that the allelic ratio is problematic. In many heterozygous sites, even if we can see in the RNAseq data both alleles that are present in the DNA, the ratio between the number of reads with the different alleles is far from 0.5, and thus the HaplotypeCaller (or any caller that expects a diploid genome) will miss that call. A DNA-aware mode of the caller might be able to fix such cases (which may be candidates also for downstream analysis of allele specific expression).

Although our new tool (splitNCigarReads) cleans many false positive calls that are caused by splicing inaccuracies by the aligners, we still call some false variants for that same reason, as can be seen in the example below. Some of those errors might be fixed in future versions of the pipeline with more sophisticated filters, with another realignment step in those regions, or by making the caller aware of splice positions.

As stated previously, we will continue to improve the tools and process over time. We have plans to improve the splitting/clipping functionalities, improve true positive and minimize false positive rates, as well as developing statistical filtering (i.e. variant recalibration) recommendations.

We also plan to add functionality to process DNAseq and RNAseq data from the same samples simultaneously, in order to facilitate analyses of post-transcriptional processes. Future extensions to the HaplotypeCaller will provide this functionality, which will require both DNAseq and RNAseq in order to produce the best results. Finally, we are also looking at solutions for measuring differential expression of alleles.

[1] Pär G Engström et al. “Systematic evaluation of spliced alignment programs for RNA-seq data”. Nature Methods, 2013

NOTE: Questions about this document that were posted before June 2014 have been moved to this archival thread: http://gatkforums.broadinstitute.org/discussion/4709/questions-about-the-rnaseq-variant-discovery-workflow

Updated on 2017-01-21

From NicolasRobine on 2014-06-02

@mmterpstra: I believe the GATK pipeline should only used for calling variants and not for anything else. Therefore I think you should process your reads in parallel and apply CollectRnaSeqMetrics to the resulting BAM. I believe that things like MarkDuplicate will modify the results significantly…

From ami on 2014-07-28

Hi @sirian‌

“I actually don’t quite understand how splitNcigarReads can remove “intronic overhang” if it is before the N cigar. How does it know it is intronic, without any annotation information“

You are right, the tool does not try to “guess” the splice positions, but use the ones that already discovered. In some cases, although STAR (although true for other aligners that we tried) find the correct splice position for some of the reads, but not for others. In this cases you will see in the IGV (see the screenshot we provided in the doc) that some of the reads were correctly split, while others have intronic overhang, and lead to FP in the calling step (unlike your case, where there are no correctly split reads). Those are the cases that are fixed by splitNcigarReads tool. We are aware on that some of the FP that we still have are due to similar errors as in you case, we have some ideas and we hope to fix those in the future.

btw – In your case it might be fixed by adding the known annotations (genecode or something like it) as input. We chose not to use them since we prefer to have a pipeline that can work based on the actual data. However, this is our recommendation and you can try other things and let us know if it work better for you. We would like to see such cases in order to improve our development.

Hope it makes more sense now, let me know if not.

From sirian on 2014-07-29

> @ami said:

> Hi sirian‌

> “I actually don’t quite understand how splitNcigarReads can remove “intronic overhang” if it is before the N cigar. How does it know it is intronic, without any annotation information“

> You are right, the tool does not try to “guess” the splice positions, but use the ones that already discovered. In some cases, although STAR (although true for other aligners that we tried) find the correct splice position for some of the reads, but not for others. In this cases you will see in the IGV (see the screenshot we provided in the doc) that some of the reads were correctly split, while others have intronic overhang, and lead to FP in the calling step (unlike your case, where there are no correctly split reads). Those are the cases that are fixed by splitNcigarReads tool. We are aware on that some of the FP that we still have are due to similar errors as in you case, we have some ideas and we hope to fix those in the future.

> btw – In your case it might be fixed by adding the known annotations (genecode or something like it) as input. We chose not to use them since we prefer to have a pipeline that can work based on the actual data. However, this is our recommendation and you can try other things and let us know if it work better for you. We would like to see such cases in order to improve our development.

> Hope it makes more sense now, let me know if not.

Thanks!

Did you mean using reference annotations in STAR alignment? SplitNcigarReads doesn’t seem to have such an option.

I’m considering removing any variants that fall outside of exons based on RefSeq annotation, because theoretically RNA-seq reads should be from exons only (if not consider immature mRNA).

From ami on 2014-07-29

@sirin – yes, I meant using reference annotations in STAR alignment (although it might be useful to have an option for that input in SplitNcigarReads, thanks for the idea).

The problem is that you will limit yourself to the known annotations, which based on our discussions with other people in the field (though never checked it myself) cover only about half of the real exoms, and thus you will throw all the covered exoms and any novel splice junction that your data might have.

From sirian on 2014-07-29

> @ami said:

> sirin – yes, I meant using reference annotations in STAR alignment (although it might be useful to have an option for that input in SplitNcigarReads, thanks for the idea).

> The problem is that you will limit yourself to the known annotations, which based on our discussions with other people in the field (though never checked it myself) cover only about half of the real exoms, and thus you will throw all the covered exoms and any novel splice junction that your data might have.

I actually tried redoing the alignment with Ensembl annotation, but that variant still exists. That position is in one exon by Ensembl, but not by RefSeq. It will be more stringent to use RefSeq but I’m worried it may be too stringent.

Anyway, glad to know you will improve GATK for this issue and I’m looking forward to it.

Thanks.

From humaasif on 2014-07-30

@ami thank you so much

From s6juncheng on 2014-08-08

Hi,

I’m wondering does GATK support or plan to support calling variance betwenn two bam files, for example, compare one RNAseq data with pared DNAseq data to identify RNAediting events? Ideally would also support biological replicates.

Thanks for the nice tool again.

From ami on 2014-08-08

@s6juncheng‌ We do plan to provide such option/tool. We already doing something like that with allele specific expression (i’m currently working on that) and collaborate with few group to do the same with RNA editing. When we will have tools that are ready to be used by the community we will be glad to share them, as we always do.

From jdcamp on 2014-09-15

Hello GATK team. I was trying to run SplitNCigarReads on some RNA-seq data aligned using the PRADA pipeline (BWA against genome+transcriptome reference, then all reads converted back genome coordinates). I tried using the "--refactorNDNcigar_string" option from the recent "vnightly-2014-09-12-gb0687a8" build (due to the presence of "1N1D1N" like reads), but received this error before the walker began transversal of any reads:

##### ERROR MESSAGE: Code exception (see stack trace for error itself)

Here is the stack trace:

ERROR stack trace

java.lang.UnsupportedOperationException at java.util.AbstractList.add(AbstractList.java:148) at java.util.AbstractList.add(AbstractList.java:108) at org.broadinstitute.gatk.tools.walkers.rnaseq.SplitNCigarReads.initialize(SplitNCigarReads.java:150) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)Here is the command that I used:

java -Xmx8g -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R Homosapiensassembly19.fasta -U ALLOWNCIGARREADS --refactorNDNcigarstring -I PCGA-01-0002-050-23843-0279R-SLCC279.dup.marked.bam -o PCGA-01-0002-050-23843-0279R-SLCC279.Nsplit.bam

Any thoughts? Thanks!

From Patrick_Deelen on 2014-09-30

Thank you for this clear tutorial. In this paper (http://www.sciencedirect.com/science/article/pii/S0002929713003832) several filters are used to remove among others known RNA-editing sites and variant calls at splice sites. Apparently this helps to lower the false positive rate. This leads me to the following questions:

1. Is the haplotype caller also sensitive to these potential false positive calls?

2. If the haplotype caller is sensitive to for instance RNA editing is it possible to exclude these variants upfront so they are not used in the haplotype assembly?

From Geraldine_VdAuwera on 2014-09-30

Hi @Patrick_Deelen‌

Let me bring @ami‌, our main RNAseq developer, into the conversation. He’s currently focusing on RNA editing events so he will be able to tell you more about HC’s behavior in that regard.

From ami on 2014-10-01

Hi @Patrick_Deelen‌ ,

We do not consider the known RNA editing sites as FP so we do not filter them out. In fact one implementation of our pipeline is to find the RNA editing events. One can filter out (using -XL for example or other ways that @Geraldine_VdAuwera‌ can suggest) the known RNA-sites if you have them as interval file or VCF file. However I can’t think of a way to prevent the haplotype assembly to use those site in the current pipeline.

For the variant calls at splice sites, we believe that the previous studies that use older tools had many FP in those location due to the limitation of the aligners and tools. Currently we are not filtering those type of calls since the HC and the aligning pipeline are able to do much better job in those regions and for some implementation it is important to be able to call such sites. Also in this case there is no easy way to ignore these sites in the haplotype assembly step, however if it turn out to improve the output of the pipeline, we might include such option in the future version of the tools. If you see cases of FP calls that are due to this reason, please share them with us so we can make our pipeline even better

Also, can you tell me a little bit about the purpose of RNAseq variant calling in your project? I try to learn more about the user cases of our pipeline so we can have them in mind while developing new tools and recommendations.

Let me know if you have more questions or comments

From shangzhong0619 on 2014-10-01

Hello @ami,

Thanks for the fascinating pipeline. I have some questions about the pipeline.

1. In DNAseq pipeline, there is a step “joint genotyping” which merge all the gvcf files from all samples. In RNAseq it seems we should run each sample separately? And it also comes to a next question:

2. How do we deal with the lanes and samples in RNAseq. Same with DNAseq? ( get recalibrated files separately —> merge lanes of the same sample —> realign). Or should we run through each pair of fastq files separately?

3. And for the organism without dbSNP, in the step of base quality recalibration, should we do the same with DNAseq (hard filter high quality results as dbSNP —-> run BQSR).

Thank you very much.

From Geraldine_VdAuwera on 2014-10-01

Hi @shangzhong0619,

1. At this time the RNAseq pipeline is only meant for single-sample processing, yes. We have not finished testing how the tools perform on RNAseq for multiple samples.

2. Lanes and samples should be treated the same way as for DNAseq.

3. The same recommendation applies, yes.

From Patrick_Deelen on 2014-10-03

Hi @Ami,

Thank you for your prompt and extensive answer. Let me first start by briefly explaining our project. We recently performed eQTL and ASE using only genotypes derived from public RNA-seq samples (http://biorxiv.org/lookup/doi/10.1101/007633). We are here only interested in the genomic variation since these contribute to eQTL and ASE effects. Since we started that project the number of available samples has doubled and in the near future we want to also include these samples. In contrast to our current work we would then also like to also discover novel variants instead of only calling known variants. For this reason I also asked if the GVCF mode will be made available for the RNA-seq as well, Geraldine told me that work on this is in progress.

Great that you now have higher accuracy around the splice sites, I agree that these are biological very interesting to call. I’m still a bit worried about the RNA editing sites though. In essence you are dealing with a non-diploid region at the RNA level, the 2 real copies from the DNA plus 1 or 2 haplotypes where the editing event took place. As long as this does not adversely affect the genotyping for nearby sites I’m happy, then I can simple exclude the editing sites at a later time. It would be a pity though if the nearby accuracy drops due to an editing event, is this something that you assessed?

We would by the way be happy to beta test the GVCF method in combination with RNA-seq genotyping when the software is ready for this. We are using the Geuvadis RNA-seq data as a benchmarking set since these samples are also part of the 1000G project.

From Geraldine_VdAuwera on 2014-10-03

@ami may respond in more detail, but FYI it is already technically possible to emit GVCFs for RNAseq data, if you’re interested in testing this. You just combine the specific arguments of both modes. We’d love to hear whether this produces meaningful results for you and what are the chief error modes you observe. By the way, if you like living on the cutting edge, you can try the in-development versions which are made available every night; see nightly builds in the downloads section.

From dnaase on 2014-10-15

I get the similar error as @egeulgen. few warning line for each sample “Interpreter – ![0,2]: ‘QD < 2.0;’ undefined variable QD”. I just copy and paste the command line in this tutorial, only change the file name to my own VCF file. The bam file is aligned by MapSplice2, instead of output quality score > 20, i output all of the variant regions with qual > 0

From Geraldine_VdAuwera on 2014-10-15

@dnaase, I don’t understand what you are saying. Can you please clarify exactly what command you ran and what was the complete error message?

From GooderPanQi on 2014-10-29

@ami, HaplotypeCaller supported for multiple samples calling on RNA-Seq?

From ami on 2014-10-29

@GooderPanQi‌ , we never tried that ourself, but there is no reason why it should not work. You should probably follow the DNA pipeline, using gvcfs, and using the specific parameters for RNA. We was just discussing it with our friends in Mount Sanai and they will also try to do it.

Just to be clear – we do not have experience with that, so that the reason that it is not in the recommendation, but we will be happy to get any feedback from you and other users that are willing to try that, and hopefully, based on that, we can update the recommendation at some point.

Please let us know how it goes.

From GooderPanQi on 2014-10-30

@ami, we run the step “variant calling”, but there’re something wrong: “Invalid command line: The parameter recoverDanglingHeads is deprecated. This argument is deprecated since version 3.3”. GATK version 3.3 don’t supported the parameter “recoverDanglingHeads”??

From Geraldine_VdAuwera on 2014-10-30

Hi @GooderPanQi‌

That functionality is enabled by default now; this is described in the [3.3 version notes](https://www.broadinstitute.org/gatk/guide/version-history). I’ll update the RNAseq docs accordingly.

From corlagon on 2014-11-12

Hi ami and Geraldine,

First of all, thanks for this excellent workflow! Easy to follow and very descriptive!

I do have, however, some remarks concerning read groups and mapping quality. You wrote in the workflow, that these need to be adjusted, after running star. Are you aware that this can be done already within star? With the parameter “—outSAMmapqUnique” you can change the mapping quality assigned for unique mappings from the default 255 to 60. With the “—outSAMattrRGline” parameter you can add a read group line to the sam header. I further think it’s useful to mention that star can directly output mappings in a coordinate sorted BAM file using “—outSAMtype BAM SortedByCoordinate”. If you prefer picard sorting, you could at least start with an unsorted BAM using “—outSAMtype BAM Unsorted”.

Just to let you know.

Best,

From Geraldine_VdAuwera on 2014-11-12

@corlagon Thanks! Are these capabilities in the latest version of STAR? We know Alex has been making a lot of upgrades, and we haven’t yet updated this doc to fit, so I wouldn’t be surprised if that was the reason for these redundancies.

From ami on 2014-11-12

@corlagon‌ – thanks, we do aware of all those changes since we do touch bases with Alex. I assume that in the next versions of the pipeline, some of those changes or all of them will be included (in fact Alex included some of those options after our joint discussions).

Currently, since we focus on using the existing pipeline for several applications, and we plan to revisit the pipeline (and improve it farther more) after we will get more experience with several of its main use cases.

Please continue to give us feedback and comments, to allow us make sure we provide the best recommendations.

From GooderPanQi on 2014-11-25

Hi ami and Geraldine, thanks for this excellent workflow. Since we used ‘Star’ to align RNA raw read, we find some problems that the number of aligned bam ( 19291348, ‘samtools flagstat’ was used ) with ‘STAR’ is more than RNA sequencing raw reads ( 18698820, wc -l fq ). Then, how we should compute mapped rate on RNAseq with ‘STAR’ ??

From Geraldine_VdAuwera on 2014-11-25

@GooderPanQi The developer of STAR, Alex Dobin, would be better qualified to answer that question. I believe he has a mailing list for support.

From santayana on 2014-12-08

Hi there,

New member here. I was wondering whether you would consider adding your “dangling-head” functionality to the Unified Genotyper as well. I’ve decided to use UG for RNAseq variant calling because my samples are heterogeneous primary tumours which likely contain a variable and unpredictable number of haplotypes in a given region, and so I didn’t think HC was appropriate. I realize GATK does not support variant calling for tumour samples, but other tools (eg. MuTect) do not support RNAseq, so I tried to make the best of it.

It turns out that this RNAseq pipeline performs quite well for my samples, in that I get strong concordance between calls made from RNAseq and whole exome for the same individual (~90%), given adequate coverage of the variant. From manual inspection of my alignments, it seems that many of my false positive calls from RNAseq are caused by these “dangling heads” misaligned to introns. Something like “—recoverDanglingHeads” in UG would clean a lot of this up, I think. Just a suggestion.

Cheers!

From ami on 2014-12-08

santayana‌ can you upload an IGV screenshot of such an example, I don't think you are thinking about "dangling heads" in the same way that we are, but I want to be sure. [I will let

Geraldine_VdAuwera‌ to replay about any future development of UG]

From santayana on 2014-12-08

![](https://drive.google.com/file/d/0B4VAqfRxlxGpYU0wX2xEdEQ0d1U/view?pli=1)

Thanks for the quick reply Ami. The top window is whole exome and the bottom window is RNAseq from the same individual. It seems that the variant is only present in reads that only extend a few bases into the intron, suggesting to me that this a misalignment.

From santayana on 2014-12-08

[https://drive.google.com/file/d/0B4VAqfRxlxGpYU0wX2xEdEQ0d1U/view?pli=1](https://drive.google.com/file/d/0B4VAqfRxlxGpYU0wX2xEdEQ0d1U/view?pli=1)

From Geraldine_VdAuwera on 2014-12-08

Hi @santayana‌,

Unfortunately it is impossible to add this functionality to the UnifiedGenotyper, since it does not perform any reassembly or realignment of the reads. In any case, the UnifiedGenotyper is now permanently deprecated, which means that we will no longer make any fixes or improvements to it. Going forward, we are only putting development effort into HaplotypeCaller.

For the record, the HaplotypeCaller is neither more nor less appropriate for tumor samples than UnifiedGenotyper, but at least it does handle RNAseq properly, so I would recommend you use HC rather than UG. In future we may be able to provide support for tumor RNAseq, but that’s still a long way down the road.

From santayana on 2014-12-08

Hi Geraldine,

Thank you for your reply. It may be a moot point now, but please let me follow up your comment about HC vs UG for tumour samples, since this is a point of confusion for my colleagues and I. UG seems to me to be more suitable than HC for genetically heterogeneous tumour samples for the following reason: Say a tumour has two subclones: one contains a heterozygous mutation A, and one contains a heterozygous mutation B. These mutation modify adjacent bases, but they are mutually exclusive: no cell (and no haplotype) contains both A and B mutations. As I understand it, HC calls the two likeliest haplotypes, which in this case would be 1) wild type for both A and B, and 2) mutant for one of A or B (whichever is more abundant). In other words, it is bound to miss one of the mutations, because it cannot handle more than two haplotypes. UG on the other hand, considers each mutation independently, so, as long as the heterozygous mutant genotype has higher likelihood than wild type, then the mutation will be called, even if the mutant allele frequency deviates strongly away from 0.5. Hence, A and B are likely to be called by UG. In my tests, heterozygous mutants can be called with minor allele frequencies as low as 0.1, presumably because a heterozygous genotype is still more likely than a wild type genotype when sufficient depth is present.

Have I misunderstood something?

From Geraldine_VdAuwera on 2014-12-08

Hmm, I guess that’s a fair point — but you could solve this in HC by using the `-ploidy` argument to allow the caller to consider additional haplotypes.

From ami on 2014-12-09

santayana‌ 1) since it is an alignment issue, I just wanted to make sure you are using 2 pass STAR and use the SJ from the first round in the second round. Do you? 2) That type of problem are suppose to be fixed by SplitNCigarString and not by "dangling heads", which fix the missed true variants at the begging of exons. If you do use SplitNCigarString in you pipeline, it might still not fixed that specific error, since the default mismatches to clip the overhang are 2. You might change it to one and see how many "real" event you will remove, and how many FP you will remove, and decide based on that tradeoff if to use it or not. 3) I don't think you are write regarding your interpretation of how the HC will handle your suggested case. If there is no read to support the A+B haplotype, it won't be called as one event although it is a consecutive bases. Let me bering to the discussion

valentin who is actively developing the HC.

4) As Geraldine suggested the multi ploid option of HC can solve many of the more complex cases that might give advantage to a single locus caller, and we are testing it currently with RNA editing calling, and hopefully will publish even better recommendation for such cases in the near future.

Best,

Ami

From santayana on 2014-12-09

Hi Ami and Geraldine,

1) Yes, I am following the pipeline that you have set out here, with the only exception of substituting HC for UG.

2) Ah OK, I understand now this is SplitNCigarString issue. I’ll try changing the default option – thanks!

3) That’s my point, that one of A or B won’t be called in the sample even though they are both present.

4) I suppose I could experiment with the -ploidy option, although I imagine the intent of this option is to handle non-diploid organisms, and not the issue of clonal heterogeneity of tumours discussed here (although broad and focal copy number changes are common in cancer, which makes calculating the number of haplotypes even more complicated). The extent of tumour heterogeneity is an active area of research, but single cell sequencing data suggests it is very extensive. If I set -ploidy > 10, wouldn’t that slow things down a lot?

From Geraldine_VdAuwera on 2014-12-10

Higher ploidy means more complex calculations, which take more time, that’s true. But the results will be more accurate. It’s up to you to decide where to make the tradeoff.

Just keep in mind that we don’t consider UG to be an appropriate substitute for HC.

From santayana on 2014-12-11

OK, I will test this out. Thank you for your feedback!

From sirian on 2015-02-06

Could I use tophat instead of STAR for the read alignment? Does anything need to be changed in the pipeline if I use tophat?

From Sheila on 2015-02-09

@sirian

Hi,

Sure, you can use Tophat instead of STAR. However, we do not give any recommendations about any possible adaptations that might be needed when using a different aligner than STAR.

-Sheila

From sirian on 2015-02-26

I followed the pipeline on this page. There are some weird indels at splicing positions. Here is one below:

![](http://bx.psu.edu/~weisheng/private/Capture.PNG)

HC called an indel at the splicing juction:

`chr5 176887555 . TCCTGAGGGCACGAGGAAAAGGTTGGGGCTGGGCCAGGCAAGCTGAGATAGCCCTGTCTGCCTCACCTGGCTGGTGCCCCCAGCTCTCA T 957.73 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=-0.604;ClippingRankSum=1.638;DP=285;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=0.063;QD=3.36;ReadPosRankSum=-2.440 GT:AD:DP:GQ:PL 1/1:2,252:254:99:995,360,0`

By blatting the sequence, the called indel allele here is just the intron (starting from the 3rd nt in the allele sequence). You can see nearly all reads got correctly spliced by mapping and clipped by SplitNCigarReads. I don’t understand how HC could call it as an indel. And nearly all the alignment in the exon region end with TC which is exactly the end of that exon. So even if the intron was mis-called as an indel, the allele should be TC instead of T. However when HC called it as an indel, it even ignored the last C.

Any idea? Thanks.

From Geraldine_VdAuwera on 2015-02-27

This depends on the aligner identifying the splice junction correctly. If it doesn’t, then the reads won’t be mapped with the N cigar elements that the split & trim tool needs to do its job. Did you use STAR or something else?

From sirian on 2015-02-27

> @Geraldine_VdAuwera said:

> This depends on the aligner identifying the splice junction correctly. If it doesn’t, then the reads won’t be mapped with the N cigar elements that the split & trim tool needs to do its job. Did you use STAR or something else?

Yes I used STAR. It must have identified the correct splice junction. The alignment plot I showed above is the input for HC. It’s after split & trim. You can see almost all reads get clipped at the correct junction. I believe mapping and splitting/trimming did the right job here. But HC made a wrong call. I don’t know how it can be.

From Geraldine_VdAuwera on 2015-03-02

The call metrics in your VCF record suggest that the HC sees a majority of reads spanning the exon, with a deletion. Maybe the reads are getting remapped unexpectedly. Try having HC generate the output bam (see tech doc for argument details) as that will show you what it thinks the region looks like after reassembly.

From ami on 2015-03-10

@sirian did you generated the HC bam output and checked it? I’m interesting to know if you find out what was going wrong on that site.

From amitm on 2015-04-08

hi GATK dev.,

I am using the GATK best practices for var-calling from RNA-seq and I was wondering if I could specify the minimum read depth in Haplotype Caller. I went through the documentation but couldn’t find anything similar.

RNA-seq data, by design, would have unequal depth across exons and I think a parameter that takes the expression value for the gene before deciding read-depth threshold would be very useful.

Is it possible to specify depth any how, gene-wise or one value, in HC?

From Sheila on 2015-04-08

@amitm

Hi,

There is no parameter for minimum read depth in Haplotype Caller. However, you can use -minReadsPerAlignmentStart. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#—minReadsPerAlignmentStart This will set the minimum number of reads at the start of an active region.

You can also filter out variant calls with low DP after running Haplotype Caller.

-Sheila

From amitm on 2015-04-08

@Sheila

Hi,

I appended read depths to the variants called by HC and the min. for the variant allele is 1 (filtered depth of course). So I am little relieved that HC wouldn’t do a blind refusal to call at low depth.

What started my inquiry was failure of HC to call variants where the actual gene was expressed low, like 2 reads (after dedup.) with 1 WT and 1 mut. I know that its there because for the same sample I have evidence from whole exome.

So, I was wondering if the thresholds (whatever HC uses) could be tweaked for a list of pos. (those from whole-exome, say) to comprehensively identify somatic mutations that are expressed in RNA.

This would be very helpful in prioritizing antigenic somatic mutations with more potential (seen in RNA => more potential)

Thanks

From ami on 2015-04-08

Just a thought – Sheila do you think amitm can use GGA mode? we never tested it with RNA-seq, but in theory it should work and might be able to correctly call those positions. I also not sure if in this case the HC will ignore any minimum threshold it might have.

From Geraldine_VdAuwera on 2015-04-08

@ami Unfortunately GGA mode is buggy in HC. (we’ve had an improvement request to fix for a while but it’s just very low priority)

Very low read depth is always going to be challenging to deal with. What you could do to improve sensitivity to this sort of case would be to do a separate run in which you tell the tools to ignore duplication flags (which is the equivalent of not marking duplicates — there is an option to do this in the latest nightly build). Then you can evaluate and merge the results intelligently. You’d want to give priority to the calls made on the dedupped version, but include calls made only on the non-dedupped data at sites where DP is low in the dedupped data, in case they can rescue real variants that were overlooked due to DP in the dedupped data. Does that make sense?

From amitm on 2015-04-09

@Geraldine_VdAuwera

Thank you for the suggestion. I will try it. Though it would take a lot of hand-holding for the data. But this should be able to pull out the ones affected by DP. I have used this approach in targeted MiSeq data for low cov. regions.

I notice that there is an -L parameter in HC. Can it accept single base positions, rather than intervals, for restricting SNV calling in RNA-seq only to pos. identified in whole-exome (or any type of DNA seq from paired sample)?

thanks again

From Geraldine_VdAuwera on 2015-04-09

Yes, you can actually just pass a VCF of sites of interest to -L.

From amitm on 2015-04-09

Oh, thats very interesting! Many thanks. I will try this. Will surely streamline my steps.

From brohawndg on 2015-06-01

Hello,

Thanks so much for providing this pipeline. Most helpful!

I followed the directions as indicated, but when I go to run BQSR with the below command:

java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R /Volumes/DroboStorage/HomosapiensUCSChg19/Homosapiens/UCSC/hg19/Sequence/Bowtie2Index/genome.fa -I split.bam -knownSites /Volumes/DroboStorage/HomosapiensUCSChg19/Homosapiens/UCSC/hg19/Annotation/Everything\ Else/Variation/dbsnp138.hg19.vcf -o recaldata.table

I get an error message saying:

ERROR MESSAGE: The platform (platform) associated with read group GATKSAMReadGroupRecord @RG:id is not a recognized platform. Allowable options are ILLUMINA,SLX,SOLEXA,SOLID,454,LS454,COMPLETE,PACBIO,IONTORRENT,CAPILLARY,HELICOS,UNKNOWN

I believe this to be directly tied to this directive detailed below that proceeded the 2-step STAR alignment:

java -jar AddOrReplaceReadGroups I=staroutput.sam O=rgadded_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample

Do I need to convert the split.bam back to a sam, change platform to Illumina, convert back to bam and then run BQSR from there? Is there a flag I can use to tell BaseRecalibrator to be more lenient?

Thanks!

Dave Brohawn

From Geraldine_VdAuwera on 2015-06-01

@brohawndg You can run AddOrReplaceReadGroups directly on the bam file. You’ll need to replace the stand-in values in the command by actual values, e.g. you’d write `RGPL=ILLUMINA` if you are running on ILLUMINA data, instead of `RGPL=platform`. Sorry if that wasn’t clear in the doc.

From brohawndg on 2015-06-01

Haha d’oh that makes perfect sense. You wrote it to be compatible with anybody’s data…. Ok well thanks again!!

From ami on 2015-06-01

@brohawndg also, please note that the new version of STAR can also add RG while you run STAR, if you prefer to do it in that step.

From thileepan on 2015-08-06

Hi,

The organism on which I work doesnt have good genome so I have have to rely totally on de novo assembled Transcriptome. In that case can I use STAR aligner or can I make use of the sorted abm that has been generated using Bowtie2 and samtools ? Kindly guide me.

From Sheila on 2015-08-10

@thileepan

Hi,

We don’t have a definite answer for you. We recommend using STAR if you have a whole genome reference, but for a transcriptome reference, we do not have recommendations.

If you use the output of Bowtie2, just make sure the output matches the output format of STAR.

It may be best to test out both STAR and Bowtie2 to see which gives better results.

Good luck!

-Sheila

From streetcat on 2015-08-16

Hi,

First I’d like to apologise for not using the correct terminology here, as I’m an engineer, not a biologist.

I’m following the above pipeline, trying to find variance between RNAseq of two “groups” of plants.

For each group, three “samples” were sequenced, so I have six bam files: G1S1, G1S2, G1S3, G2S1, G2S2, G2S3

How do I find variance between the two groups?

Variance between the groups and the reference is not the goal.

Thanks,

Raviv.

From streetcat on 2015-08-17

And now with better terminology, thanks to my biologist wife.

We would like to compare the transcriptome sequence of two genotypes: wild-type and mutant.

Each genotype have three technical repeats.

We’ve already aligned the reads to a reference genome, following the above protocol, but uncertain about the variant calling step.

Thank you.

From Geraldine_VdAuwera on 2015-08-17

@streetcat Yay for the helpful biologist wife! (I am someone’s biologist wife myself ;))

That said your original description made enough sense already. You’ll want to call variants the same way as recommended for pretty much all experimental designs (see HaplotypeCaller GVCF workflow). Once you have joint calls for all samples together, it’ll be a matter of subsetting variants based on e.g. whether they are unique or shared in wild-type vs. mutant (depending on what you’re looking for).

From streetcat on 2015-08-18

@Geraldine_VdAuwera are you referring to this workflow: https://www.broadinstitute.org/gatk/guide/article?id=3893

From streetcat on 2015-08-19

@Geraldine_VdAuwera, Could you please verify that I’m setting all the right parameters here:

`java -jar $GATK -T HaplotypeCaller -R Pepper.v.1.5.total.chr.fasta -I sample1_split.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 —variant_index_type LINEAR —variant_index_parameter 128000 —emitRefConfidence GVCF -o sample1.g.vcf`

BTW, I’m running this on a mac book pro with 16GB of RAM, and it looks as if it will be taking couple of hours to run, and using only on CPU. does GATK support multiple cores?

Thanks!

From brohawndg on 2015-08-24

Hello,

Best way to cite this specific workflow?

Thanks!

Dave Brohawn

From Sheila on 2015-08-25

@brohawndg

Hi Dave Brohawn,

You can either cite the papers stated here: https://www.broadinstitute.org/gatk/about/citing

or cite the actual webpage.

There is a paper that may be in review about the workflow, but I have to find out the details. I will get back to you.

-Sheila

From Sheila on 2015-08-26

@streetcat

Hi raviv,

Sorry for the late response. You are looking at the correct workflow.

You command is fine. The only thing is that -stand_call_conf 20.0 -stand_emit_conf 20.0 will not be taken into account. The GVCF will output all potential variants regardless of quality score. You can set your thresholds in GenotypeGVCFs.

-Sheila

From streetcat on 2015-09-01

Hi @Sheila

Thanks for clarifying.

On a different topic, what is the recommended tool for adding annotations from a GFF3 file to the combined VCF file?

Thanks,

Raviv.

From daih on 2015-09-06

Hi,

Currently I’m working on calling variants from RNA-seq data. Following this workflow, I got VCF files for my sequencing data. But there are no ’0/0’ (wild genotype) in my VCF file. I wonder if this is correct. Is it because I did “calling variants” step for only one sample? Is there a way to see wild genotype alleles?

Thanks

Hongji

From Sheila on 2015-09-08

@daih

Hi Hongji,

You can use -out_mode to specify which types of sites to output to the VCF. The default is variant sites only. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#—output_mode

-Sheila

From Sheila on 2015-09-08

@streetcat

Hi Raviv,

Sorry for the late response. I don’t think we support the GFF3 format. The best thing to do is convert to VCF.

-Sheila

From dan_m on 2015-09-20

Hi – I’m following this pipeline on some RNA samples, everything seems normal til I get to mapping against the generated splice junction reference with STAR. The final star alignments are 99% unmapped. Star log says: % of reads unmapped: too short | 99.11%

final star output SAM file is 300mb down from 30gb.

Anyone else experience this ?

From Geraldine_VdAuwera on 2015-09-20

Hi @dan_m, we can’t answer that for you, so I recommend you ask the author of STAR for help with this. He has an active mailing list: https://groups.google.com/forum/#!forum/rna-star

From dan_m on 2015-09-21

> @Geraldine_VdAuwera said:

> Hi dan_m, we can’t answer that for you, so I recommend you ask the author of STAR for help with this. He has an active mailing list: https://groups.google.com/forum/#!forum/rna-star

Ok, FYI it turned out to be the —sjdbOverhang 75 creating the splice junction reference.

For my data having insert size 101 I had to make it 201 (2 * insert size -1)

No more error. Thanks for writing up this tutorial :)

From Hasani on 2015-10-05

Hi,

could you please confirm what kind of non. ref allele HC reports?! meaning that the allele on RNA -level (expressed) is not the one on DNA-level. So the final reported result, would it be Ref allele, non.ref as observed in the data = expressed one?!

thank you!

From Sheila on 2015-10-05

@Hasani

Hi,

The allele is representative of any alleles at the site that are not the reference allele or alternate allele. Basically, it is any allele that is present at the site, but does not have enough evidence to call it as an alternate allele.

For example, let’s say you have 20 reads with good qualities covering a site, and the reference at the site is A. 10 reads have an A, 9 reads have a T, and 1 read has a G. The reference will be A, the alternate will be T, and NON-REF will represent the G.

I hope this helps.

-Sheila

From wrosan on 2016-01-19

Hi GATK team,

I am trying to run this pipeline to call variant in my RNAseq data. I got my sam files from STAR aligner. However, when I try to addorreplacegroups using PICARD I get the following error message:

Exception in thread “main” htsjdk.samtools.SAMFormatException: Error parsing text SAM file. length(QUAL) != length(SEQ); File checK_nobT.bam; Line 1

Then I thought of using samtools to convert from SAM to BAM and use PICARD addorreplacegroups option. Then I got the following error message:

Exception in thread “main” htsjdk.samtools.SAMFormatException: Error parsing text SAM file. length(QUAL) != length(SEQ); File checK_nobT.bam; Line 1

Please, help me solve this error?

From yasinkaymaz on 2016-01-19

As I read the error line, it says that quality string length is not equal to sequence length; length(QUAL) != length(SEQ). So, you may wanna have your sam/bam file checked/validated…

From wrosan on 2016-01-19

@yasinkaymaz

Looks like I copied the same error message for both of my commands. Actually I had different error message when I used Picard AddOrReplaceReadGroups options for the SAM files. The error message is:

Exception in thread “main” htsjdk.samtools.SAMFormatException: Error parsing text SAM file. Not Enough Fields; File 2Aligned.out.sam; Line 1

When I converted this SAM file to BAM using Samtools and used Picard AddOrReplaceReadGroups options for the BAM file, I got the following error message. I am confused how I got different error message after SAM to BAM conversion. Also, I noticed an extra ‘N’ inserted at the end of the sequence shown in the error message:

Exception in thread “main” htsjdk.samtools.SAMFormatException: Error parsing text SAM file. length(QUAL) != length(SEQ); File checK_nobT.bam; Line 1

From Geraldine_VdAuwera on 2016-01-19

@wrosan It sounds like your sam file is badly formatted. You can run ValidateSamFile in MODE=SUMMARY to evaluate how bad it is.

From wrosan on 2016-01-19

@Geraldine_VdAuwera I actually tried that too and got the same error message:

Caused by: htsjdk.samtools.SAMFormatException: Error parsing text SAM file. Not enough fields; File 2Aligned.out.sam; Line 396

Line: NB500921:39:HK5VYBGXX:1:11309:16590:5060 0 Supercontig_2.9 702412 255 151M1S * 0 0 CAGCAAGCTCGATCATCACCAAAACCCGCTCCAAAGACCACCACCACCACCCCAACCGTAAATTCTTCACCATGCACCTTGGACTCTACCTTGCTTGGGCTCTGGTTGTACTCCACTCTGTGGTAGCTCACCCGCAGTGCTACGAACCCTG at htsjdk.samtools.SAMLineParser.reportFatalErrorParsingLine(SAMLineParser.java:432) at htsjdk.samtools.SAMLineParser.parseLine(SAMLineParser.java:217) at htsjdk.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:248) at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:236) at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:212) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:545) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:519) at htsjdk.samtools.SamFileValidator.validateSamRecordsAndQualityFormat(SamFileValidator.java:263) … 6 more

From Geraldine_VdAuwera on 2016-01-19

Then your file is unusable if it’s not even readable by that tool. Perhaps something went wrong during your alignment job. I suggest redoing ti from scratch — and make sure you use the most recent version of STAR.

From wrosan on 2016-01-20

@Geraldine_VdAuwera

I got a suggestion to use tail command for my aligned.sam file and found out that SAM files was corrupted. Now, I am trying to redo the alignment from the start. However, I was wondering if I could use the commands, ‘—outSAMattrRGline’ to add RG and -’-outSAMtype BAM SortedByCoordinate’ to get the final output as sorted BAM. Will this affect the downstream processes?

Also, does the files size of FASTQ affect the alignment? I have 23 samples and the fastq file size range from 12Gb to 30GB.

Thanks,

Roshan

From Geraldine_VdAuwera on 2016-01-20

@wrosan, STAR isn’t our software so my ability to comment on what it does is limited, but that sounds perfectly fine.

Can’t speak to the effect of file sizes — you’re better off asking the author of STAR. Alex has a google group mailing iirc, and is very responsive.

From guidoleoni on 2016-01-31

I don’t completely understand why you remove duplicated reads from rnaseq data. I know that for exome seq this is done to remove PCR amplification artifacts, In RNA seq the high number of duplicated reads is often considered as measure of RNA expression. So I guess that for snv this could reflect the number of clones with the mutation. Removing duplicates might affect the real frequency of the mutation. PLease could you give a feedback about this observation? namy thanks

From ami on 2016-01-31

For the specific pipeline of variant discovery we remove the duplication reads from the same reason that you mentioned – remove sequencing errors that might be duplicate in the PCR process and thus count as real variants. The assumption is that if there is a real variant in the RNA, you will have enough reads that are not exact duplicates that support that variant. I would not believe any variant that all its support comes from only duplicates reads. You are right, that if we are interested in the exact frequency of the alleles (and not just in the HET vs HOM state) we might want to keep the duplicated reads, but I would only do it in a second pass, after I discovered all the HET sites, and then call again on the specific sites of interest, as we and other do in Allele specific expression pipelines.

From Geraldine_VdAuwera on 2016-02-10

I believe that is indeed the case, @Longinotto

From y4dar on 2016-02-25

I have what is likely a simple (and potentially stupid) question. But I’d rather ask and look stupid than waste a several hours troubleshooting by myself.

In this step, java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I input.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o output.vcf

which is the input.bam? The dedupped.bam is processed to give a split.bam. Am I right in assuming that the split.bam is the input for the Haplotpye caller step?

Thanks.

From Geraldine_VdAuwera on 2016-02-26

It is if you’re skipping indel realignment and base recalibration. The former is fine but I wouldn’t recommend the latter.

From pwaltman on 2016-03-04

In a previous comment, you wrote:

Geraldine_VdAuwera said: > Hi

shangzhong0619,

> 1. At this time the RNAseq pipeline is only meant for single-sample processing, yes. We have not finished testing how the tools perform on RNAseq for multiple samples.

> 2. Lanes and samples should be treated the same way as for DNAseq.

…

Has the limitation to analyzing single samples been resolved yet?

Also, when analyzing a sample that has fastq’s from multiple lanes, you recommended that we treat them the same way that DNAseq samples would be treated? However, is it really necessary to re-run the split-‘n-trim step again AFTER we’ve merged the BAMs from the different lanes? I can understand why one would want to re-run the dedup, realignment and recalibration steps, but the split-n-trim step seems unnecessary, and is throwing errors for a few of my samples, i.e. it’s complaining about a read that has a start position that occurs after it’s stop position (“The stop position 90336265 is less than start 90336266 in contig 15”). Unfortunately, I don’t have access to all the intermediate bam files to determine the step that introduced that error, but can re-run them if necessary.

From pwaltman on 2016-03-04

According to an answer I found on the comments for the deprecated RNAseq pipeline page (http://gatkforums.broadinstitute.org/discussion/4709/questions-about-the-rnaseq-variant-discovery-workflow), it sounds like the Split’N‘Trim step only needs to be performed once, and the suggested was to apply it to the merged bam. Is this still true?

If not, what are the recommended steps for dealing with samples that have reads from multiple lanes?

If the suggestion is still true (to only apply it 1x on the merged bam), I’d suggest that you add a clarification as it currently appears that you recommend that users run the full pipeline on the lane-specific bams, as well as the merged bam – hence running the Split’N‘Trim step 2x.

From Geraldine_VdAuwera on 2016-03-05

This is a misunderstanding, apologies of our docs weren’t clear. We certainly don’t recommend running all the preprocessing steps twice. The only step that you would run twice of you have multiple lanes per sample is marking duplicates. We’ll clarify.

But no, we haven’t yet validated multi-sample analysis of RNAseq. There isn’t any obvious technical obstacle, we just haven’t tested it yet so we can’t guarantee results.

From pwaltman on 2016-03-05

> @Geraldine_VdAuwera said:

> This is a misunderstanding, apologies of our docs weren’t clear. We certainly don’t recommend running all the preprocessing steps twice. The only step that you would run twice of you have multiple lanes per sample is marking duplicates. We’ll clarify.

> But no, we haven’t yet validated multi-sample analysis of RNAseq. There isn’t any obvious technical obstacle, we just haven’t tested it yet so we can’t guarantee results.

Hmmm….. well, I just got an answer about when to run the Split’N‘Trim step. GATK v.3.5 will throw an error if you try to apply any of the subsequent analytical steps if you haven’t already run the Split’N‘Trim step first, i.e. the RealignerTargetCreator walker threw an error before I could run any of the subsequent steps.

With respect to the pre-processing steps, are you absolutely sure that we only need to re-run the de-dup step? According to article 3060 ([https://www.broadinstitute.org/gatk/guide/article?id=3060](https://www.broadinstitute.org/gatk/guide/article?id=3060)), it clearly states that we should re-run the de-dup, realignment & base recalibration steps:

> 3. Per-sample pre-processing

> At this point you perform an extra round of marking duplicates. This eliminates PCR duplicates (arising from library preparation) across all lanes in addition to optical duplicates (which are by definition only per-lane).

> The example data becomes:

> sample1.merged.dedup.bam

> sample2.merged.dedup.bam

> Then you run indel realignment and base recalibration (BQSR).

> Realigning per-sample means that you will have consistent alignments across all lanes within a sample.

> Base recalibration will be applied per-lane (as it should be) if you assigned appropriate read group information in your data.

From Geraldine_VdAuwera on 2016-03-08

@pwaltman The article you link to states the following:

> […] we run the initial steps of the pre-processing workflow (mapping and sorting) separately on each individual read group, then we merge the data to produce a single BAM file for each sample (aggregation). Then we re-run Mark Duplicates, run Indel Realignment and run Base Recalibration on the aggregated per-sample BAM files.

This is further detailed in bullet point format that boils down to:

- run each [per-lane] file through mapping, sorting and marking duplicates (dedup)

- merge and run mark duplicates per-sample (actually this can be done in a single step by running MarkDups on all lane files at once for a sample)

- then you run indel realignment and base recalibration per-sample

It does not state that you should do realignment and BQSR in the per-lane processing. That used to be the case (a couple of years ago), but no longer.

If you’re running on RNAseq, you simply insert the Split’N‘Trim step after the second round of MarkDuplicates.

From pwaltman on 2016-03-08

> @Geraldine_VdAuwera said:

> …

> It does not state that you should do realignment and BQSR in the per-lane processing. That used to be the case (a couple of years ago), but no longer.

> If you’re running on RNAseq, you simply insert the Split’N‘Trim step after the second round of MarkDuplicates.

Ok, apologies for my misunderstanding! In that case, will it pose a problem if we/I ran the Split’N‘Trim, realignment & BQSR steps on each lane prior to merging? I realize that it wasn’t necessary, but will it cause issues with the downstream analysis, i.e. would the data somehow be over-normalized and/or over-pre-processed?

From Geraldine_VdAuwera on 2016-03-08

No, it shouldn’t hurt. The second SplitNTrim will do nothing, the second realignment will only “complete” the realignment process started per-lane, and BQSR should not make any significant changes if the recalibration worked well the first time around (you can check the recalibration plots and see that the blue and pink dots should be in mostly the same places).

From pwaltman on 2016-03-10

I’m getting an error similar to what danase reported in October 2014

Specifically, the call to the GATK that I used was the following: java -Djava.io.tmpdir=$TMPDIR/tmp \ -Xmx32g \ -jar /home/waltmape/bin/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R $TMPDIR/broad/Homo_sapiens_assembly19.fasta \ -V merged_CBN_raw.vcf \ -o merged_CBN_filtered.vcf \ -window 35 -cluster 3 -filterName FS -filter “FS > 30.0” -filterName QD -filter “QD < 2.0”

And the error that I got is the following (below is the complete output, with the errors highlighted): INFO 23:50:13,183 HelpFormatter – ———————————————————————————————————————— INFO 23:50:13,186 HelpFormatter – The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56 INFO 23:50:13,186 HelpFormatter – Copyright © 2010 The Broad Institute INFO 23:50:13,186 HelpFormatter – For support and documentation go to http://www.broadinstitute.org/gatk INFO 23:50:13,189 HelpFormatter – Program Args: -T VariantFiltration -R /tmp/537624.1.standard.q/broad/Homo_sapiens_assembly19.fasta -V merged_CBN_raw.vcf -o merged_CBN_filtered.vcf -window 35 -cluster 3 -filterName FS -filter FS > 30.0 -filterName QD -filter QD < 2.0 INFO 23:50:13,192 HelpFormatter – Executing as waltmape@node107.panda on Linux 2.6.32-279.el6.x86_64 amd64; Java HotSpot™ 64-Bit Server VM 1.8.0_51-b16. INFO 23:50:13,192 HelpFormatter – Date/Time: 2016/03/07 23:50:13 INFO 23:50:13,193 HelpFormatter – ———————————————————————————————————————— INFO 23:50:13,193 HelpFormatter – ———————————————————————————————————————— INFO 23:50:13,601 GenomeAnalysisEngine – Strictness is SILENT INFO 23:50:13,675 GenomeAnalysisEngine – Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 23:50:13,834 GenomeAnalysisEngine – Preparing for traversal INFO 23:50:13,839 GenomeAnalysisEngine – Done preparing for traversal INFO 23:50:13,839 ProgressMeter – [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 23:50:13,839 ProgressMeter – | processed | time | per 1M | | total | remaining INFO 23:50:13,840 ProgressMeter – Location | sites | elapsed | sites | completed | runtime | runtime **WARN 23:50:16,173 Interpreter – ![0,2]: ‘QD < 2.0;’ undefined variable QD WARN 23:50:16,453 Interpreter – ![0,2]: ‘QD < 2.0;’ undefined variable QD WARN 23:50:17,040 Interpreter – ![0,2]: ‘QD < 2.0;’ undefined variable QD WARN 23:50:20,248 Interpreter – ![0,2]: ‘QD < 2.0;’ undefined variable QD WARN 23:50:29,978 Interpreter – ![0,2]: ‘QD < 2.0;’ undefined variable QD ** INFO 23:50:34,575 ProgressMeter – done 647958.0 20.0 s 32.0 s 100.0% 20.0 s 0.0 s INFO 23:50:34,575 ProgressMeter – Total runtime 20.74 secs, 0.35 min, 0.01 hours INFO 23:50:35,216 GATKRunReport – Uploaded run statistics report to AWS S3

Does that just mean that the QD information isn’t available for some variants?

From Sheila on 2016-03-10

@pwaltman

Hi,

You guessed correctly! Those WARN statements simply tell you the annotation value is missing, and you can ignore them :relaxed:

-Sheila

From npriedig on 2016-03-12

Hello,

I performed this pipeline and the results in the vcf generally match what I’m seeing in the STAR-aligned RNA-seq .bam file via IGV, which is fantastic. The STAR alignment produced a relatively high mapping rate (>80%).

However, I’m noticing a huge loss of reads after the HaplotypeCaller (please see below) in the majority of samples. Some even as high as 95-98% filtered out. Is this ‘normal’ behavior for GATK RNA-seq variant calling and how would you recommend I troubleshoot this if not? Notably, this is an analysis of a capture-based RNA-seq library (Illumina Access, ~20K genes).

Any insight you have would really help our efforts.

sincerely,

INFO 06:49:44,034 HaplotypeCaller – Ran local assembly on 62928 active regions

INFO 06:49:47,290 ProgressMeter – done 3.137161264E9 11.3 h 12.0 s 100.0% 11.3 h 0.0 s

INFO 06:49:47,291 ProgressMeter – Total runtime 40618.52 secs, 676.98 min, 11.28 hours

INFO 06:49:47,291 MicroScheduler – 202398991 reads were filtered out during the traversal out of approximately 260542074 total reads (77.68%)

INFO 06:49:47,292 MicroScheduler – -> 109173730 reads (41.90% of total) failing DuplicateReadFilter

INFO 06:49:47,292 MicroScheduler – -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter

INFO 06:49:47,292 MicroScheduler – -> 93225261 reads (35.78% of total) failing HCMappingQualityFilter

INFO 06:49:47,293 MicroScheduler – -> 0 reads (0.00% of total) failing MalformedReadFilter

INFO 06:49:47,293 MicroScheduler – -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter

INFO 06:49:47,293 MicroScheduler – -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter

INFO 06:49:47,294 MicroScheduler – -> 0 reads (0.00% of total) failing UnmappedReadFilter

INFO 06:49:48,523 GATKRunReport – Uploaded run statistics report to AWS S3

From VinzCH on 2016-03-15

Yes

npriedig, I have the exact same issue (but with even worse numbers) I am wondering if this is due to a recent update of STAR or Picard that leads to more stringent rules for duplicate markings or Mapping Quality? I don't think that this error was reported before, and I did not see

ami, Geraldine or Sheila mentioning this problem with the pipeline.

Indeed, any insights or parameter tuning would be cool.

INFO 03:39:18,346 HaplotypeCaller – Ran local assembly on 9273438 active regions

INFO 03:39:18,427 ProgressMeter – done 2.730871774E9 4.8 h 6.0 s 100.0% 4.8 h 0.0 s

INFO 03:39:18,427 ProgressMeter – Total runtime 17277.69 secs, 287.96 min, 4.80 hours

INFO 03:39:18,430 MicroScheduler – 84656048 reads were filtered out during the traversal out of approximately 101644337 total reads (83.29%)

INFO 03:39:18,431 MicroScheduler – -> 0 reads (0.00% of total) failing BadCigarFilter

INFO 03:39:18,431 MicroScheduler – -> 61870337 reads (60.87% of total) failing DuplicateReadFilter

INFO 03:39:18,432 MicroScheduler – -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter

INFO 03:39:18,432 MicroScheduler – -> 22785711 reads (22.42% of total) failing HCMappingQualityFilter

INFO 03:39:18,432 MicroScheduler – -> 0 reads (0.00% of total) failing MalformedReadFilter

INFO 03:39:18,433 MicroScheduler – -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter

INFO 03:39:18,433 MicroScheduler – -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter

INFO 03:39:18,433 MicroScheduler – -> 0 reads (0.00% of total) failing UnmappedReadFilter

INFO 03:39:21,655 GATKRunReport – Uploaded run statistics report to AWS S3

From Geraldine_VdAuwera on 2016-03-16

Hi @npriedig, as shown in the filtering summary the main reasons you are losing reads are MAPQ and duplicates, so it’s a quality control problem. You mention a high mapping rate, but you should check what portion of your mapped reads are actually uniquely mapped, since those are the only reads that will be usable for variant discovery. Regarding duplicates, it’s either a problem of true duplication, or the prep method you use generates the appearance of duplication (as in amplicon-based sequencing). If the latter, you can disable the DuplicateRead filer (using -drf) but you should evaluate results carefully if you do.

From Geraldine_VdAuwera on 2016-04-01

Hi @npriedig, sorry for the very late response. I thought I had answered this and closed the issue ticket prematurely.

In a nutshell, you have two “problems” leading to filtering out of many reads: one is the high rate of duplication (41.90% of total as indicated in the filtering summary) and the other is the also high rate of reads with a MAPQ below 20 (35.78% of total) which HaplotypeCaller disregards as useless.

Regarding the duplication, one thing to keep in mind is that for RNAseq the calculation of duplication rates is confounded by the possibility that some apparent duplicates are actually the result of higher levels of transcription. We still prefer to filter out duplicates, but some people choose to disable the duplicate filter (using `-drf DuplicateRead`).

I would say the other problem, of low MAPQs, is more a concern. It indicates that even though it looks like your data is mapped, there are lots of reads that are not well mapped. If you followed the pre-processing procedure that we recommend, you will have converted all MAPQs of 255 (which are the unique mappers that we can use for variant discovery) to a flat 60. Those reads will automatically pass HC’s filter. So I would advise looking at the distribution of MAPQs in your files (both the original file mapped by STAR and the one obtained after pre-processing to check if the problem was there from the start or whether it arose on the way.

It’s also worth checking what is the actual depth (DP) at most of your variant sites. It’s possible that you don’t really have a problem, if you still have sufficient coverage despite all the filtering to make confident calls.

From kchang3 on 2016-04-14

I have followed the GATK best practice guideline for preparing the BAM file for mutation calling, and noticed that majority of the mutation fall in 3’UTR region, which isn’t something i notice in DNAseq. Is this a common in mutation calling in RNAseq data?

Top mutation types in my data

3’UTR 42.79%

Missense_Mutation 16.36%

Intron 12.62%

Silent 7.86%

From uki156 on 2016-04-15

Hi all, I’m new here, sorry if this has been asked before, but from what I can see it hasn’t been so far.

I tried running this workflow and I get an error at the GATK Indel Realigner tool. Basically what the error says is:

>Error caching SAM record xxxx, which is usually caused by malformed SAM/BAM files in which multiple identical copies of a read are present.

Now, I could skip this step as Indel Realigner is optional, but I’m guessing that further GATK tools in the workflow will give the same error, as this seems to be a problem with my aligned BAM file coming out of the STAR aligner.

Basically, from what I could gather so far, my BAM file has multiple identical copies of a read, which I am not really sure what the problem is about. I thought the error could be from the Mapping Qualities which aren’t proper, so I tried changing the parameter —outSAMmapqUnique in STAR from 255 to 60 as advised for these type of tasks, but I still get the same error.

Also this only happens when I try this analysis with paired-end reads, single-end reads come through ok.

Any ideas?

From Sheila on 2016-04-17

@uki156

Hi,

What kind of data are you working with and how was it sequenced? Did you follow the Best Practices pre-processing steps?

Thanks,

Sheila

From npriedig on 2016-04-17

Hello @Geraldine_VdAuwera, thank you for getting back regarding my issue! What you wrote makes perfect sense and is very clear. I will look very carefully at the MAPQ values to see if I can improve the results.

From Geraldine_VdAuwera on 2016-04-22

@kchang3 Sorry for the late reply. This is not something I’m aware of, but we don’t deal with RNAseq much so I really can’t speak with any authority on the matter. My only thought is that this does smell a bit like an artifact, so maybe consider checking whether the distribution of annotation values and scores of variants that are falling in 3’UTR regions is in line with the rest or not. If there’s something wrong with those variants, that might come up as a distinct profile on one or more dimensions.

From JPE on 2016-07-20

What is the visualization software used in this tutorial? The one showing the read alignment?

From Sheila on 2016-07-20

@JPE

Hi,

We use [IGV](https://www.broadinstitute.org/igv/), which is developed at the Broad.

-Sheila

From everestial007 on 2016-08-09

I am getting some error while running variant filtration.

WARN 22:37:17,483 Interpreter – ![0,2]: ‘QD < 2.0;’ undefined variable QD

I have been trying to find the source of the error but no luck. Never had this kind of issues before with other vcf files when running a standard (best practices). Also, the QD field is indeed present in the raw.vcf file.

Any suggestions is appreciated.

Thanks,

From Sheila on 2016-08-09

@everestial007

Hi,

Have a look at my response in [this thread](http://gatkforums.broadinstitute.org/gatk/discussion/comment/29074/#Comment_29074).

-Sheila

From everestial007 on 2016-08-10

Thanks @Sheila !

It really helped clear my concerns.

From Carolina_O on 2016-10-05

Hi, I’m new here. We are planning to make SNP calling in RNA seq data. We don’t have a reference genome, we want to compare SNP between different species with different ploidy levels.

We need to build a transcriptome reference or can we work with a related species?

Any suggestions is really appreciated.

Thank you very much.

From shlee on 2016-10-05

Hi @Carolina_O,

Our tools rely on reference-aligned sequences and calling variants against the reference. So depending on what type of variants you are interested in, e.g. those that vary from the transcriptome or those that vary between species, I will tentatively say that either should work for you. I suppose calling variants against the transcriptome may be simpler as alignment records would be kept minimal.

Your project sounds complicated and we have some forum threads that should be of interest to you, e.g. [here](http://gatkforums.broadinstitute.org/gatk/discussion/4501/correct-workflow-for-analyzing-multiple-populations-of-a-non-model-species), [here](http://gatkforums.broadinstitute.org/gatk/discussion/6461/comparing-the-populational-snv-calling-of-two-dominant-bacteria-species-candidates-from-a-freshwater) and [here](http://gatkforums.broadinstitute.org/gatk/discussion/comment/32331#Comment_32331). Also, see this [FAQ](https://software.broadinstitute.org/gatk/documentation/article.php?id=7363). I’m sure there are more discussions that would interest you and you should find them by searching the forum.

From Carolina_O on 2016-10-06

Thank you very much Shlee!

From vyellapa on 2016-10-06

Curious if the haplotype caller canbe replaced with Mutect to identify low VAF variants?

From Sheila on 2016-10-11

@vyellapa

Hi,

Have a look at [this thread](http://gatkforums.broadinstitute.org/dsde/discussion/5907/lower-gatks-haplotypecaller-threshold-for-allele-frequency-as-part-of-a-variant-calling-pipeline) which should help.

-Sheila

From jeales on 2016-10-17

Hi,

The STAR documentation strongly suggests supplying a GTF file of transcript annotations when building the genome index.

https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf

Is there any reason that step 1 does not supply a human GTF?

I was hoping that if I supplied a GTF then the splice junctions used in the second pass would be more stable across multiple samples

thanks

James

From JPE on 2016-10-18

Is it possible to annotate the resulting VCF with gene name and RS number (for human samples)? Which part of the analysis needs tweaking for this, the STAR alignment? Can you please provide an example?

From ypnngaa on 2016-10-19

Hello,

If I am also interested in phasing of those SNPs that called from this pipeline, can I just enable —emitRefConfidence GVCF? Or I just use the exactly same pipeline, use ReadsBackedPhasing?

Thanks!

From Sheila on 2016-10-20

@JPE

Hi,

You can use VariantAnnotator with [—dbsnp](https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_VariantAnnotator.php#—dbsnp), [—resource](https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_VariantAnnotator.php#—resource), and [—snpEffFile](https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_VariantAnnotator.php#—snpEffFile) after calling variants with HaplotypeCaller.

-Sheila

From Sheila on 2016-10-20

@ypnngaa

Hi,

Yes, the latest version of HaplotypeCaller does phasing automatically, either in normal mode or in GVCF mode :smile:

You only need to run ReadBackedPhasing if you want to merge MNPs.

-Sheila

From Geraldine_VdAuwera on 2016-10-24

@jeales I think this recommendation to use a GTF file came up after we wrote up our docs on this. Ultimately we defer to Alex Dobin for the most up to date way to run STAR since he is the author of that program. We’ll try to make that more clear in the docs.

From ypnngaa on 2016-10-27

Hi @Sheila ,

Thanks for replying. However, in my case, it did not automatically perform phasing. I used this command to run on my RNA Seq data:

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R $genomeFasta -I $Input_BAM -dontUseSoftClippedBases -stand_call_conf 0 -stand_emit_conf 0 -o $Output_VCF

In the output VCF, I did not see SNPs being phased. And I saw this message from GATK:

HaplotypeCaller – Disabling physical phasing, which is supported only for reference-model confidence output

Do you know why?

If I want it to do phasing, what should I use?(I know it is RNAseq based, phasing can be conducted just locally, but I need that.) Now I just used seperated steps: first use UnifiedGenotype, then ReadBackedPhase. Do you have any suggestion?

Thanks!

From Sheila on 2016-10-27

@ypnngaa

Hi,

Oh, I see. I’m sorry I mislead you. It seems phasing is only supported in GVCF mode. Right now, we recommend HaplotypeCaller in normal mode only for RNA-seq, because we have not done testing for the GVCF workflow. However, some users have reported using the GVCF workflow for RNA-seq data.

You can either try the GVCF workflow and thoroughly test your results, or run HaplotypeCaller then ReadBackedPhasing afterwards.

-Sheila

From J_R on 2017-01-16

Hi,

Yet another question, sorry. Is there more documentation unpacking why there’s an RNA-specific recommendation of hard-filtering clustered SNPs?

Thanks!

Joanna

From Sheila on 2017-01-18

@J_R

Hi Joanna,

I think [this article](https://software.broadinstitute.org/gatk/guide/article?id=3891) is all we have for now. There may be more to come in the future.

-Sheila

From mjtiv on 2017-01-20

Question about GATK 3.7 and RNA-Seq pipeline, is it safe to say the command “-stand_emit_conf 20.0” for STEP 6 has been removed from the pipeline commands? I got a warning this command was deprecated and so I removed it, otherwise the pipeline won’t run.

Just read your updates on the new version—— Which state you removed it…. My fault

From Geraldine_VdAuwera on 2017-01-21

We’ll update the article too, thanks for pointing this out.

From everestial007 on 2017-01-31

Geraldine_VdAuwera Sheila @shlee

Regarding the filtering of VCF from RNAseq using SNPcluster:

`java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter “FS > 30.0” -filterName QD -filter “QD < 2.0” -o output.vcf `

Is there a method or functionality to prepare the bed file based on this SNP cluster?

I would rather want to filter the reads from the bam file and run Denovo assembly and blast search to find other potential mapping areas for the reads on the genome.

Since, this problem addresses `paralogous alignment` in which one set of reads (mostly containing reference allele) will be true alignment while the other would be paralogs; I was thinking if reads can be filtered in following manner.

- select the site that have SNP clusters and prepare bed file.

- A. filter reads based on that bed file.

- B. Or, only select the reads that contains dominantly non-reference SNP cluster, suggesting that is the paralogous read. This way only paralogous alignment will be filtered.

From shlee on 2017-02-03

Hi @everestial007,

If by paralogous alignments you are referring to reads that can multiply map within a given genome, then I think low or zero mapping scores (MAPQ) in a region would be the basis for your analysis workflow. I have to say that it’s not clear to me how filtering on FisherStrand helps towards identifying regions with low/zero MAPQ reads. I do see that QualByDepth, being normalized by depth supporting a variant, could help but this metric also factors for quality. It’s not clear then how quality would correlate with regions with reads with low or zero MAPQ scores. I’m just learning about these annotation metrics so perhaps I’ve misunderstood your approach.

Can you please define what you mean by SNP cluster? We have some filters that target reads that are likely contaminant (GATK engine read filters [here](https://software.broadinstitute.org/gatk/documentation/tooldocs/)) as well as sites with clustered events (in MuTect2).

Let me answer the parts of your questions that I can.

Within Picard and GATK tools, we typically use interval lists more so than bed files. Here are some tools that can help you prepare an intervals list:

- Picard [VcfToIntervalList](https://broadinstitute.github.io/picard/command-line-overview.html#VcfToIntervalList)

- Picard [IntervalListTools](https://broadinstitute.github.io/picard/command-line-overview.html#IntervalListTools)

- Picard [BedToInteralList](https://broadinstitute.github.io/picard/command-line-overview.html#BedToIntervalList)

To convert an intervals list to a bed file, please see the awk command in [Tutorial#7156](http://gatkforums.broadinstitute.org/gatk/discussion/7156/howto-perform-local-realignment-around-indels), right before the start of section 3.

You might also be interested in Picard’s [MakeSitesOnlyVcf](https://broadinstitute.github.io/picard/command-line-overview.html#MakeSitesOnlyVcf), to simplfy the starting VCF. Also, it appears Picard’s [FilterVcf](https://broadinstitute.github.io/picard/command-line-overview.html#FilterVcf) may also do your FS/QD filtering.

All this being said, to subset out reads to filter reads based on sites within a VCF you really don’t need to create a bed/intervals file. You can simply supply the vcf with the `-L` parameter, e.g. `-L file.vcf`, in conjunction with a GATK tool, e.g. PrintReads. The `-L` option also accepts interval lists and bed files. You might want to double-check if all that is needed is overlap to subset reads or if you need to supply padding (`—interval_padding`) to cover the entirety of reads and perhaps change the merging approach (`—interval_set_rule`). These are defined in the GATK [engine parameters](https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_engine_CommandLineGATK.php) that you can use with all GATK tools.

From shlee on 2017-02-03

P.S. @everestial007. I’ve just confirmed with our engineer that you can print out all reads that touch the interval/site. No need to expand intervals.

From mjtiv on 2017-02-10

In your RNA-Seq pipeline it is mentioned known issues may occur with HaplotypeCaller with different alleles that are far from the 0.5 ratio which will influence ASE studies. The suggestion is to use a “DNA-aware” mode of the caller to fix this issue. Could you specifically explain more about this “DNA-aware” mode and how to perform it in HaplotypeCaller or if necessary the entire pipeline? Thank You

From Geraldine_VdAuwera on 2017-02-13

@mjtiv What you’re referring to is a hypothetical feature we were considering but never had the opportunity to build. The idea was to use a DNA normal to control for estimating baseline expectations.

From Judy on 2017-03-27

I have the same problem with “MESSAGE: Error caching SAM record xxx, which is usually caused by malformed SAM/BAM files in which multiple identical copies of a read are present.” during IndelRealigner, but I have tried adding command line -rf NotPrimaryAlignment , it doesn’t work. Is there anybody having some solutions?

From Geraldine_VdAuwera on 2017-03-27

@Judy please do NOT post the same comment multiple times. It clutters the forum and wastes our time.

Have you validated your bam file?

From kulbirsinghsandhu on 2017-04-19

Hi everyone, I am doing trying to find variants using my RNAseq data (it was generated for expression analysis). I had 8 samples which were indexed and sequenced in a single lane. I received 16 FASTQ paired end files from 8 samples after sequencing. These 8 samples are derived from 2 vareities and 2 conditions. I am interested in finding variants among these 2 plant varaities. The fastq files were trimmed for adaptor sequences and only paired sequences were retained by using Trimmomatic. My question is that during the first and second alignment step with STAR, can i use all the fastq files for alignment at the same time? This will get me one output sam file to take into next steps upto base recalibration. Or should I align each read pair separately and then go upto base recalibration steps separetly for each read pair, and merge before variant calling step? Please suggest. Thanks, Kulbir

From Sheila on 2017-04-24

@kulbirsinghsandhu

Hi Kulbir,

I think [this thread](http://gatkforums.broadinstitute.org/gatk/discussion/6299/should-rna-seq-libraries-be-aligned-in-parallel-or-consecutively) will help.

-Sheila

From Yogesh on 2017-05-15

I am Trying to find SNP in the transcriptome. I have finished the Split’N‘Trim and reassign mapping qualities step. after this can I do indel realingment using these command: But sorted bam file is needed, can I sort the bam file using samtool:

/home/samtools/samtool sort dedup.bam >sorted.dedub.bam

Generate intervals of interest from sample alignments:

java -jar GenomeAnalysisTK.jar \

-T RealignerTargetCreator \

-nt 4 \

-R refgenome.fa \

-I **sorted.dedup.bam **\

-o realign.intervals

Realign (multiple sequence alignment):

java -jar GenomeAnalysisTK.jar \

-T IndelRealigner \

-R refgenome.fa\

-targetIntervals realign.intervals \

-I sample1.sorted.dedup.bam \

-o sample1.sorted.dedup.realigned.bam

Fix mate pair info in BAM (PICARD):

java -jar $PICARDDIR/picard.jar FixMateInformation \

INPUT=sample1.sorted.dedup.realigned.bam \

OUTPUT=sample1.sorted.dedup.realigned.fixmate.bam \

SO=coordinate \

CREATE_INDEX=true

From shlee on 2017-05-17

The above question is duplicated at .

From Govardhan on 2017-05-31

Hi,

I performed variant calling using RNA-seq in our cohort as we had poor quality exome seq data and thanks to GATK for this pipeline I identified commonly identified hot-spot mutations.

We are planning to include these findings in upcoming publication and I have question about cutting this pipeline.

Should I just cite three GATK publication??

Also, did anyone published anything using same pipeline??

Thanks in advance.

Cheers,

Gova

From Sheila on 2017-06-05

@Govardhan

Hi Gova,

We recommend citing both the Best Practices pipeline (from the GATK site) and one of the GATK papers.

-Sheila

From mjtiv on 2017-06-16

With variant calling step (HaplotypeCaller), how is strand direction taken into account? The reason I am asking is because I am trying to compare RNA-seq variant data produced using this pipeline to genotyping data that is always (+) strand.

From everestial007 on 2017-06-16

@mjtiv

I think the variants are always reported in context of the reference genome you supplied. The RNAseq transcripts may come from either + or – strand but, variants are always reported in context of the ref allele.

From everestial007 on 2017-06-17

@mjtiv

Hope you already know this now. But, just to verify that assertion I made earlier: https://groups.google.com/forum/#!topic/rna-star/ehHFgV_2cjU

See the response on similar problem, from `Alex Dobin` the author of STAR-Rna.

From lshepard on 2017-08-15

Hi, has this pipeline been tested with the HISAT2 aligner? If so, how did it compare to the results with STAR?

From Sheila on 2017-08-21

@lshepard

Hi,

According to [this article](https://software.broadinstitute.org/gatk/documentation/article.php?id=3891), “For RNA-seq, we evaluated all the major software packages that are specialized in RNAseq alignment, and we found that we were able to achieve the highest sensitivity to both SNPs and, importantly, indels, using STAR aligner. “. You can find more details under “1. Mapping to the reference”

-Sheila

From lshepard on 2017-08-23

Sheila said: >

lshepard

> Hi,

> According to [this article](https://software.broadinstitute.org/gatk/documentation/article.php?id=3891), “For RNA-seq, we evaluated all the major software packages that are specialized in RNAseq alignment, and we found that we were able to achieve the highest sensitivity to both SNPs and, importantly, indels, using STAR aligner. “. You can find more details under “1. Mapping to the reference“

> -Sheila

@Sheila

Yes, I noticed that, but I am it is not very clear to me when this note was added to the page as it was first created in 2014 when HISAT2 was not available yet. It says it was last updated in 2017, but were the newer aligners tested as well?

Thanks for the clarification!

From Sheila on 2017-08-24

@lshepard

Hi,

I am not sure if there has been any testing that recently. I will check with the team and get back to you.

Of course, STAR is what we recommend, but you are free to use whatever aligner you have determined is best for your needs :smile:

-Sheila

From lshepard on 2017-08-28

@Sheila

Thank you! I appreciate you check-in with the team!

From shlee on 2017-08-28

Hi @lshepard,

I believe testing was done between TopHat and Star and the decision to use Star had in part to do with its output of compatible metadata, e.g. CIGAR string representations.

From Geraldine_VdAuwera on 2017-08-31

That’s right, a comparison was made on several packages including TopHat and Star; criteria included sensitivity metrics and interoperability of results (well-behaved output format). Wrt sensitivity we found that Star gave the best overal results across SNPs and indels, though iirc other packages may have provided slightly better results over SNPs alone. That was several years ago though, and we have not reevaluated this since as RNAseq has not been a development priority. This is likely to change in the near future however.

From lshepard on 2017-09-01

@Geraldine_VdAuwera

Sounds good, thank you for giving the details of the analysis performed! I might do some of these comparisons myself in the near future.

From krdav on 2017-09-05

@lshepard

Somebody else has already done a similar comparisons that indeed does claim better results using TopHat2:

https://wellcomeopenresearch.org/articles/2-6/v2

From uff on 2017-09-05

Hi Geraldine, I am planning to do variant calling in RNA-seq data but haven’t prepared the sample library yet. Could you please give me some advice on the minimum sequencing depth required to find meaningful true variants?

From lshepard on 2017-09-05

krdav said: >

lshepard

> Somebody else has already done a similar comparisons that indeed does claim better results using TopHat2:

> https://wellcomeopenresearch.org/articles/2-6/v2

@krdav

Interesting, thanks for pointing it out. I will need to check the details, but if that is indeed the case then I would expect potentially better results with HISAT2. Thanks again!

From Sheila on 2017-09-06

@uff

Hi,

We don’t have any specific recommendations, but GATK tools are tested on 30X data.

-Sheila

From dsap on 2017-09-18

Hi Geraldine_VdAuwera and Sheila,

It has been several years since this article was originally was posted. You described this possibility: “We also plan to add functionality to process DNAseq and RNAseq data from the same samples simultaneously”

Is HaplotypeCaller yet capable of using WGS and RNA-seq data simultaneously to improve SNP/Indel calling?

Thanks!

Dan

From Sheila on 2017-09-21

@dsap

Hi Dan,

I don’t think there has been much done in that area, so it is not a good idea to run on both DNA and RNA together. Let me confirm with the team and get back to you.

-Sheila

From Geraldine_VdAuwera on 2017-09-21

Unfortunately we haven’t implemented that functionality, sorry.

From jgould on 2017-10-19

Is there a WDL available that implements this workflow?

From Sheila on 2017-10-24

@jgould

Hi,

I am checking with the team and will get back to you.

-Sheila

From Geraldine_VdAuwera on 2017-10-30

@jgould We currently don’t have a WDL for variant calling on RNAseq; it’s on the roadmap of workflows we aim to provide and support in early 2018.

From lidia on 2018-01-24

Hello GATK team I am new in the use of this toolkit and therefore I have some questions. I am working with RNA paired end read sequences. My goal is to find SNPs in 2 strains of trout to be able to compare them, from each strain I have 8 fish sampled, each fish with 6 different tissues. I already went through the best practices workflow and I have several questions about the STAR steps. My first steps are the following:

picard CreateSequenceDictionary R= genome.fna O= genome.dict

samtools faidx ReferenceGenome/genome.fna

STAR —runMode genomeGenerate —genomeDir GenomeDir –genomeFastaFiles ReferenceGenome/genome.fna —sjdbGTFfile ReferenceGenome/Annotations.gff —limitGenomeGenerateRAM 140000000000 —runThreadN 8

STAR —runMode alignReads —genomeDir GenomeDir —readFilesCommand zcat —readFilesIn Forelle/${R1} Forelle/${R2} —outFileNamePrefix ${sample}_ —runThreadN 8 /

I understand that I have to do the previous step for every tissue independently. Therefore I would be generating 96 SJ.out.tab files.

For the second pass how can I create the genome indexes if I have 96 files of spliced junctions? or do I have to repeat this process independently 96 times so that it will be like creating 96 genome directories?

STAR —runMode genomeGenerate —genomeDir GenomeDir2 —genomeFastaFiles ReferenceGenome/genome.fna —sjdbFileChrStartEnd ${sample}_SJ.out.tab —sjdbOverhang 64 —limitGenomeGenerateRAM 140000000000 —runThreadN 8 /

STAR —genomeDir GenomeDir2 —readFilesCommand zcat —readFilesIn Forelle/$R1 Forelle/$R2 —outFileNamePrefix ${sample}_2pass_ —runThreadN 8 /

Thanks!

From Sheila on 2018-01-29

@lidia

Hi,

I am not sure I understand what you mean by “For the second pass how can I create the genome indexes if I have 96 files of spliced junctions? or do I have to repeat this process independently 96 times so that it will be like creating 96 genome directories?” What do you mean by “genome indexes”? Are you aligning to one reference genome?

Thanks,

Sheila

From lidia on 2018-01-29

Hi Sheila,

Yes, I am aligning to one reference genome. Therefore, since I understand that the alignment has to be done tissue by tissue, at the end of the first pass I have 96 SJ.out.tab files. But I am unsure about how to include them for the second pass.

From Sheila on 2018-01-31

@lidia

Hi,

Ah, I see. Thanks for the clarification. We have an example of how to run the alignment step [here](https://software.broadinstitute.org/gatk/documentation/article?id=3891). I think you will have to re-run for 96 files. You may want to ask the STAR team for any other questions about the usage, as we cannot help much with that.

-Sheila

P.S. For your samples, are you trying to determine differences in tissues? I am assuming so. You may also find [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/11310/about-the-haplotypes-in-hg19-fasta-and-the-gtf-file/p1) helpful.

From flynnchen on 2018-02-09

Hi GATK Team,

It seems like you guys didn’t use GTF annotations in the STAR index mapping. Would you guys recommend using GTF for mapping RNA to reference using STAR? Some of our co-workers suggested that it might be an option. Would that improve the variant calling quality?

Thanks

From bhaas on 2018-02-13

Is there an updated version of this protocol for use with gatk-4, as some of the options in gatk have changed.

From Sheila on 2018-02-14

@flynnchen

Hi,

I think [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/11310/about-the-haplotypes-in-hg19-fasta-and-the-gtf-file) may be helpful.

-Sheila

From Sheila on 2018-02-14

@bhaas

Hi,

The Best Practices are in development for GATK4, so we don’t have full documentation for it yet.

-Sheila

From Vrushali on 2018-02-17

Hi,

The tools RealignerTargetCreator and IndelRealigner are obsolete in GATK4. Are there any replacements for these tools in GATK4?

Also, there are two option -BQSR (used with PrintReads) and -stand_emit_conf (used with HaplotypeCaller) that are not available with GATK4. What should be done in these cases?

From Sheila on 2018-02-20

@Vrushali

Hi,

Have a look at [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/7131/is-indel-realignment-removed-from-gatk4). However, there are plans to port those tool now, and you can keep track of it [here](https://github.com/broadinstitute/gatk/issues/3104).

For BQSR, have a look at [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/comment/46067#Comment_46067). For stand_emit_conf, have a look at [this blog post](https://software.broadinstitute.org/gatk/blog?id=8692) to find out why it is gone.

-Sheila

From mjtiv on 2018-02-22

Question about pipeline, I just got into a conversation with a lab mate about using a gtf file in an RNA analysis pipelines and realized that GATK’s Best Practices for RNA-Seq does NOT appear to use a reference gtf file in the analysis (slight heart attack thinking need to re-run everything). Could you expand on this gtf free option? Also, I did look back at my notes and see I did use gtf file to help create the STAR genome index reference file (hopefully doesn’t mess things ups).

Any feedback on this topic would be greatly appreciated and fingers crossed its not bad news.

From Sheila on 2018-02-23

@mjtiv

Hi,

Have a look at [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/11310/about-the-haplotypes-in-hg19-fasta-and-the-gtf-file).

-Sheila

From lfall on 2018-05-10

Hello, I am trying to call SNPs in my RNA-seq data using the GATK pipeline, but I am running into the same error over and over. The alignments were done using STAR and I was able to make it through the workflow steps 1 (mapping to reference (STAR)) and two (Add read groups, sort, mark duplicates, create index (Picard)). When I get to the Split'N'Trim step, I get an error. I have tried changing and removing just about everything I can think of in the command but I still get this error saying that the argument is wrong. I've checked so many times on the GATK best practices to see if my scripts are wrong, but they seem to match what is written on the workflow. I'm very sorry if this question has an obvious answer, I have tried so many things and I can't seem to find the answer anywhere! Any help is very much appreciated :) Thank you!

This is the script I am running:

!/bin/bash

SBATCH -N 1

SBATCH --ntasks 2

SBATCH --mem 24G

SBATCH --mail-type ALL

SBATCH -J split.n.trim

module load GATK/3.8-0-Java-1.8.0121 module load Java/1.8.0152 module load R/3.2.5-IGB-gcc-4.9.4 module load Python/2.7.13-IGB-gcc-4.9.4

java -jar GenomeAnalysisTK.jar \ -T SplitNCigarReads \ -R /home/a-m/lfall/Fusariumrnaseq/data/genome/NCBI/GCF000240135.3ASM24013v3genomic.fna \ -I BHS.dedupped.bam \ -o BHS.split.bam \ -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOWNCIGAR_READS

This is the error I get after running this script:

[lfall@compute-5-0 rg.duplicates.index]$ more slurm-622962.out To execute GATK run: java -jar $EBROOTGATK/GenomeAnalysisTK.jar

The following have been reloaded with a version change: 1) Java/1.8.0121 => Java/1.8.0152

The following have been reloaded with a version change: 1) Java/1.8.0152 => Java/1.8.0121

INFO 16:00:00,666 HelpFormatter - ---------------------------------------------------------------------------------- INFO 16:00:00,685 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50 INFO 16:00:00,685 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 16:00:00,685 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk INFO 16:00:00,685 HelpFormatter - [Thu May 10 16:00:00 CDT 2018] Executing on Linux 3.10.0-693.11.1.el7.x8664 amd64 INFO 16:00:00,685 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0121-b13 INFO 16:00:00,688 HelpFormatter - Program Args: -T SplitNCigarReads -R /home/a-m/lfall/Fusariumrnaseq/data/genome/NCBI/GCF000240135.3ASM24013v3geno mic.fna -I BHS.dedupped.bam -o BHS.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOWNCIGARREADS INFO 16:00:00,694 HelpFormatter - Executing as lfall@compute-0-13 on Linux 3.10.0-693.11.1.el7.x8664 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.012 1-b13. INFO 16:00:00,699 HelpFormatter - Date/Time: 2018/05/10 16:00:00 INFO 16:00:00,699 HelpFormatter - ---------------------------------------------------------------------------------- INFO 16:00:00,699 HelpFormatter - ---------------------------------------------------------------------------------- ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/home/a-m/lfall/Fusariumrnaseq/r esults/NCBI_GATK/start.from.sam/rg.duplicates.index/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console... INFO 16:00:00,997 GenomeAnalysisEngine - Deflater: IntelDeflater INFO 16:00:00,997 GenomeAnalysisEngine - Inflater: IntelInflater INFO 16:00:00,997 GenomeAnalysisEngine - Strictness is SILENT

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: Invalid command line: Failed to load reference dictionary

ERROR ------------------------------------------------------------------------------------------

From shlee on 2018-05-11

Hi @lfall,

The message:

> ERROR MESSAGE: Invalid command line: Failed to load reference dictionary

Indicates the tool is unable to access the dictionary associated with the reference. Your reference is

> -R /home/a-m/lfall/Fusarium_rnaseq/data/genome/NCBI/GCF_000240135.3_ASM24013v3_genomic.fna

So here are the things I would try:

1. Generate a reference dictionary with [Picard CreateSequenceDictionary](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_CreateSequenceDictionary.php) in the same directory as the reference.

2. It’s possible the tool is having an issue with the `.fna` extension of your reference. Typical reference FASTA extensions are `.fasta` or `.fa`. Try switching this out, assuming your reference is actually in simple FASTA format. Otherwise, try to convert your FNA file to FASTA.

From kibrahim90 on 2018-05-14

after creating an index with only 1.sj.out file. I realized that my sam files are missing columns. How can I create the index with multiple sj.out.tab files ?

From shlee on 2018-05-14

Hi @kibrahim90, it’s not clear what you are asking. Perhaps you will find SAM specifications [here](https://samtools.github.io/hts-specs/) helpful.

From kibrahim90 on 2018-05-14

shlee said: > Hi

kibrahim90, it’s not clear what you are asking. Perhaps you will find SAM specifications [here](https://samtools.github.io/hts-specs/) helpful.

how do I run star (aligner) to create an index with multiple sj.out.tab files is what im asking. Its not in the manual. I checked.

From shlee on 2018-05-14

@kibrahim90, Star is not a GATK/Picard tool and therefore we do not support it. Please try posting your question to Biostars or SeqAnswers.

From CarolineJ on 2018-07-25

Hi team –

Do you have any suggestions for the variant calling step (#6) and the hard filtering step (#7) for multisample RNAseq experiment aligned to a de novo transcriptome? Also, is there a still a benefit to using HaplotypeCaller over UnifiedGenotyper for cases where the reference is a de novo assembled transcriptome?

6. GATK best practices for variant calling for a single sample/align to ref. genome

```

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I input.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -o output.vcf

```

My “best guess” at customizing GATK best practices for a multisample experiment with alignment to a de novo transcriptome

```

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R de.novo.transcriptome.fasta -I input_sample1.bam input_sample2 -stand_call_conf 20.0 -o output.vcf

```

7. GATK best practices for hard filtering

```

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R de.novo.transcriptome.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter “FS > 30.0” -filterName QD -filter “QD < 2.0” -o output.vcf

```

My best guess is that the hard filtering settings would be identical for a multisample/denovo transcriptome ref design.

From Sheila on 2018-08-03

@CarolineJ

Hi,

Unfortunately, our team has not dedicated much time to the RNA pipeline, but we do have some recommendations in the github repo [here](https://github.com/gatk-workflows/gatk3-4-rnaseq-germline-snps-indels). I think the team is now working on better Best Practices for RNA-seq, so keep an eye out for those.

-Sheila

From XiaoshenYin on 2018-11-10

Hello,

When I run the command

mkdir runDir

cd runDir

STAR —genomeDir /scratch/snyder/y/yin168/gatk/test/genomeDir —readFilesIn /scratch/snyder/y/yin168/gatk/test/400_R1.fastq /scratch/snyder/y/yin168/gatk/test/400_R2.fastq —runThreadN 5

I always got an error:

/var/spool/torque/mom_priv/jobs/1411989.snyder-adm.rcac.purdue.edu.SC: line 18: 34161 Segmentation fault

(sometimes it is line 16, instead of line 18)

How should I resolve this issue?

Thanks a lot!

From XiaoshenYin on 2018-11-13

Hello,

In the gatk RNAseq variant calling pipeline, there is a step for the 2-pass STAR as below:

3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass:

genomeDir=/path/to/hg19_2pass mkdir $genomeDir STAR —runMode genomeGenerate —genomeDir $genomeDir —genomeFastaFiles hg19.fa —sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab —sjdbOverhang 75 —runThreadN

Here, —sjdbOverhang is set to 75. However, the default value of —sjdbOverhang is 100. Why is it set to 75 instead of 100? Does that make a difference? What value should I set for my analysis.

Thanks a lot!

From XiaoshenYin on 2018-11-21

Hello,

When running the gatk RNAseq variant calling pipeline, I am not quite sure about whether I should spend time doing 4. Indel Realignment and 5. Base Recalibration.

For 4 – How significant the effect would be if I ignore this step? It says “we only recommend performing the realignment step if you have compute and time to spare (or if it’s important not to miss any potential indels)”. I am calling SNPs from RNAseq, so how should I evaluate/determine whether it is important not to miss any potential indels? Could you list potential cases in which it is important not to miss any potential indels (or in which it is okay to miss potential indels)?

Also, I see a post saying that, if a haplotype assembly is done (for example, run HaplotypeCaller), then it is not necessary to run Indel Realignment. Is this true? Since I am going to do 6. Variant calling by running HaplotypeCaller, does this mean I do not need to run Indel Realignment at all?

For 5 – How big/marginal the effect would be if I just skip 5. Base Recalibration? When would the qualities have systematic error modes, as mentioned in the explanation for this step? Also, since —known-sites is a parameter I would need to provide in Base Recalibration, but I do not have a “known sites” resource file. Does it still make sense to do this step? If so, how should I generate/prepare a “known sites” resource file?

Thanks.

From XiaoshenYin on 2018-11-21

Hello,

I use the command to filter called variants according to the pipeline:

java -jar ~/bin/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar -T VariantFiltration -R ../sl_genome/gPmar100.fa.txt -V 400.vcf -window 35 -cluster 3 -filterName FS -filter “FS > 30.0” -filterName QD -filter “QD < 2.0” -o 400_filter_5.vcf —setFilteredGtToNocall

In the filter column, I see “PASS”, “snpcluster” and “QD”. I would assume “QD” indicates that a specific locus has a poor call (i.e. with QD<2.0) and thus this variant should be filtered. Even though I add “—setFilteredGtToNocall”, I still see that the genotype is “0/1”. However, the genotype of this variant should be “./.” because I use “—setFilteredGtToNocall”. I try different options even with the latest gatk 4, but it still does not work. How can I make —setFilteredGtToNocall work and filter out the poor variants or set their genotypes to ./.?

Also, what is the difference between gatk-package-4.0.11.0-local.jar and gatk-package-4.0.11.0-spark.jar? The gatk-package-4.0.11.0-spark.jar does not work on my cluster, so is it okay if I use gatk-package-4.0.11.0-local.jar?

Thanks.

From XiaoshenYin on 2018-12-12

Hello,

I am calling variants from RNAseq data using this gatk pipeline. I did what the pipeline suggest exactly and did not set ploidy for HaplotypeCaller. I got both “.” and “./.” in the vcf file. “./.” stands for a diploid site at which the genotype cannot be called and “.” stands for a haploid site at which the genotype cannot be called. Why do I get both missing haploid and diploid sites? To be more specific, why do I get “.”, haploid site at which the genotype cannot be called? Should I keep or remove “.” in analysis?

Thanks.

From TomWillDo on 2019-01-03

Hi there,

Sorry if this has been answered previously, but I can’t seem to find it. Previous SNP/Indel hard filters (to serve in place of VQSR) include filters for SOR, but they seem to be missing from the RNA-seq pipeline. If someone could point me towards why they haven’t been included, I would be appreciative!

From hamaor on 2019-01-07

Thanks a lot for this guide!

I used this pipeline to analyze 6 samples of RNA-Seq (3 replicate of treatment and 3 replicate of control).

after generating the filtered vcf, I’m using the GenotypeGVCFs to join treatment replicates together and control replicates together. (and I might want to join the whole dataset).

Technically, 3 vcf files each 1.7G was joined into single 24MB file.

is that reduction normal for the algorithm? can I make any validation to make sure I’m on the safe side?

thanks

Maor

From encoded79 on 2019-01-23

Hello there

I used your mutation analysis pipeline for RNA-seq data. However, when i compare the BAMs , pre and post CIGAR hard clipping on the iGV , i still see overlapping reads for intronic regions. Basically, no clipping seems to have been done. This is also reflected on the VCF files, where nearly half of variants are in fact intronic. I appreciate it if you could elaborate on the possible reasons for this. Thanks a lot

From encoded79 on 2019-01-28

From XiaoshenYin on 2019-02-06

Hello @shlee ,

Step 3 in the pipeline for calling variants from RNAseq recommends:

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

However, when I try gatk4.0, I did not find “-rf ReassignOneMappingQuality -RMQF 255 -RMQT” or “-U ALLOW_N_CIGAR_READS” in optional arguments, although they are present in gatk3.8. Is it because updates on gatk4.0 make it unnecessary to set those two parameters? In this case, can I use gatk 4.0 without setting those two arguments to run the step of Split’N‘Trim and reassign mapping qualities?

Thanks.

From seven32 on 2019-03-10

What is the correct syntax for calling HaplotypeCaller for RNA-seq data for GATK 4?

From AlFatlawi on 2019-05-06

@Geraldine_VdAuwera

Hi, regarding your comment in Oct 2014,

“At this time the RNAseq pipeline is only meant for single-sample processing, yes. We have not finished testing how the tools perform on RNAseq for multiple samples.”

Please, is there any update in GATK 4?

I have many samples (RNA Seq) and I would like to merge them in a VCF file.

Thanks in advance.

Ali

From AlFatlawi on 2019-05-06

Hi,

At 7. Variant filtering:

I see you recommended the following:

—filterExpression “FS > 30.0” -filterName QD —filterExpression “QD < 2.0”

Please, are these for SNPs, Indel or both?

and, what is your opinion if I use the same filterexpression on both SNPs and Indel?

Regards

Ali

From AlFatlawi on 2019-05-13

any answer, please!

From zjwang on 2019-08-23

XiaoshenYin said: > Hello

shlee ,

> Step 3 in the pipeline for calling variants from RNAseq recommends:

> java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

> However, when I try gatk4.0, I did not find “-rf ReassignOneMappingQuality -RMQF 255 -RMQT” or “-U ALLOW_N_CIGAR_READS” in optional arguments, although they are present in gatk3.8. Is it because updates on gatk4.0 make it unnecessary to set those two parameters? In this case, can I use gatk 4.0 without setting those two arguments to run the step of Split’N‘Trim and reassign mapping qualities?

> Thanks.

Hello, I have the same question as you, Have you solved this problem? Why do not find “-rf ReassignOneMappingQuality -RMQF 255 -RMQT” or “-U ALLOW_N_CIGAR_READS” in optional arguments

From XiaoshenYin on 2019-10-31

Hello, how should I properly cite the pipeline of Calling variants in RNAseq in a manuscript?

From XiaoshenYin on 2019-11-05

@zjwang Hello, I ended up using the older version, if I remember correctly.

Report abuse