041. Offlabel workflow to simply call differences in two samples

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by shlee

on 2018-01-29

Given my years as a biochemist, if given two samples to compare, my first impulse is to want to know what are the functional differences, i.e. differences in proteins expressed between the two samples. I am interested in genomic alterations that ripple down the central dogma to transform a cell.

Please note the workflow that follows is NOT a part of the Best Practices. This is an illustrative, unsupported workflow. For the official Somatic Short Variant Calling Best Practices workflow, see Tutorial#11136.

To call every allele that is different between two samples, I have devised a two-pass workflow that takes advantage of Mutect2 features. This workflow uses Mutect2 in tumor-only mode and appropriates the --germline-resource argument to supply a single-sample VCF with allele fractions instead of population allele frequencies. The workflow assumes the two case samples being compared originate from the same parental line and the ploidy and mutation rates make it unlikely that any site accumulates more than one allele change.

First, call on each sample using Mutect2's tumor-only mode.

``` gatk Mutect2 \ -R ref.fa \ -I A.bam \ -tumor A \ -O A.vcf

gatk Mutect2 \ -R ref.fa \ -I B.bam \ -tumor B \ -O B.vcf ```

Second, for each single-sample VCF, move the sample-level AF allele-fraction annotation to the INFO field and simplify to a sites-only VCF.

This is a heuristic solution in which we substitute sample-level allele fractions for the expected population germline allele frequencies. Mutect2 is actually designed to use population germline allele frequencies in somatic likelihood calculations, so this substitution allows us to fulfill the requirement for an AF annotation with plausible fractional values. The terminal screenshots highlight the data transpositions.

Before:

After:

Third, call on each sample in a second pass, again in tumor-only mode, with the following additions.

``` gatk Mutect2 \ -R ref.fa \ -I A.bam \ -tumor A \ --germline-resource Baf.vcf \ --af-of-alleles-not-in-resource 0 \ --max-population-af 0 \ -pon pon_maskAB.vcf \ -O A-B.vcf

gatk Mutect2 \ -R ref.fa \ -I B.bam \ -tumor B \ --germline-resource Aaf.vcf \ --af-of-alleles-not-in-resource 0 \ --max-population-af 0 \ -pon pon_maskAB.vcf \ -O B-A.vcf ```

Provide the matched single-sample callset for the case sample with the --germline-resource argument.
Avoid calling any allele in the --germline-resource by setting --max-population-af to zero.
Maximize the probability of calling any differing allele by setting --af-of-alleles-not-in-resource to zero.
Prefilter sites with artifacts and cross-sample contamination with a panel of normals (PoN) in which confident variant sites for both sample A and B have been removed, e.g. with gatk SelectVariants –V pon.vcf -XL AandB_haplotypecaller.vcf –O pon_maskAB.vcf.

Fourth, filter out unlikely calls with FilterMutectCalls.

``` gatk FilterMutectCalls \ -V A-B.vcf \ -O A-B-filter.vcf

gatk FilterMutectCalls \ -V B-A.vcf \ -O B-A-filter.vcf ```

FilterMutectCalls provides many filters, e.g. that account for low base quality, for events that are clustered, for low mapping quality and for short-tandem-repeat contractions. Of the filters, let's consider the multiallelic filter. It discounts sites with more than two variant alleles that pass the tumor LOD threshold.

We assume case sample variant sites will have a maximum of one allele that is different from the --germline-resource control. A single allele call will pass the multiallelic filter. However, if we emit any shared variant allele alongside the differing allele, e.g. for a heterozygous site without ref alleles, then the call becomes multiallelic and will be filtered, which is not what we want. We previously set Mutect2’s --max-population-af to zero to ensure only the differing allele is called, and so here we can rely on FilterMutectCalls to filter artifactual multiallelic sites.
If multiple variant alleles are expected per call, then FilterMutectCall’s multiallelic filtering will be undesirable. For example, if changes to allele fractions for alleles that are shared was of interest for the two samples derived from the same parental line, and Mutect2 --max-population-af was set to one in the previous step to additionally emit the shared variant alleles, then you would expect multiallelic calls. These will be indistinguishable from artifactual multiallelic sites.

This workflow produces contrastive variants. If the samples are a tumor and its matched normal, then the calls include sites where heterozygosity was lost.

We know that loss of heterozygosity (LOH) plays a role in tumorigenesis (doi:10.1186/s12920-015-0123-z). This leads us to believe the heterozygosity of proteins we express contributes to our health. If this is true, then for somatic studies, if cataloging the gain of alleles is of interest, then cataloging the loss of alleles should also be of interest. Can we assume just because variants are germline that they do not play a role in disease processes? How can we account for the combinatorial effects of the diploid nature of our genomes?

Remember regions of LOH do not necessarily represent a haploid state but can be copy-neutral or even copy-amplified. It may be that as one parental chromosome copy is lost, the other is duplicated to maintain copy number, which presumably compensates for dosage effects as is the case in uniparental isodisomy.

Updated on 2018-02-01

From JmeAlena on 2018-02-07

This article is awesome! I do have a question now that I read through this a bit. I guess I am just not fully understanding the —germline-resource flag. Based upon the commands in this article, where is the Baf.vcf or Aaf.vcf being generated? I understand it’s purpose and function but I just don’t how to accomplish this on the commandline.

From shlee on 2018-02-08

Hi @JmeAlena,

Thank you!

I do some hackery to reformat the data into these files. I'm pretty sure there are better ways so I left these details out.

What I do involves Excel and is as follows. Please do share any better approaches.

[1] Convert VCF data to table format using VariantsToTable.

gatk VariantsToTable \ -V A.vcf \ -F CHROM -F POS -F ID -F REF -F ALT -F TLOD \ -GF GT -GF AF \ -O A.table

[2] Modify A.table in Excel, e.g. using the CONCATENATE function:

Don't forget to add the # in front of CHROM. Copy-paste column in-column as values, then remove/modify columns to make a VCF format file. Copy-paste contents to a text editor, e.g. that has preferences set to remove trailing white spaces, and save as text file Aaf.table.

[3] Modify the VCF header.

Subset out the header then use nano to modify.

cat A.vcf | grep '##' > A.header nano A.header

Manually change (i) FORMAT-->INFOand (ii) use CTRL+k to delete unnecessary lines, etc.

[4] Concatenate header and body.

cat A.header Aaf.table > Aaf.vcf

[5] Validate file.

gatk ValidateVariants -V Aaf.vcf

From JmeAlena on 2018-02-22

Awesome! I am going to give it a try and I’ll let you know how it goes. Thank you!

From JmeAlena on 2018-02-24

Hi! Sorry to bother you all again but I am setting up this pipeline to work for my data and I was curious if the panel of normals (PoN) file was necessary to run the program.

To recap what I am doing….

I am using clonal plant lineages to detect accumulation of somatic mutations as discussed in [this forum](https://gatkforums.broadinstitute.org/gatk/discussion/11182/convert-combinevariant-output-to-individual-vcf-files#latest “this forum”). I want to basically obtain any unique SNVs among 12 samples of cloned plants exposed to differing selective pressures. So for example, a SNV might be present in some but not all. I technically have a control but mutation accumulation cannot be ruled out in these samples and must be analyzed as well.

I was curious if I do not have a model organism with known or confident sites, should I provide a PoN or can I leave this flag out? It appears that this flag has been created for those doing cancer research and I am not sure if it is applicable to my research. If it is required, how would you recommend I build and filter this type of vcf in my case?

With further consideration and thought, I have decided to take your advice one building the —germline-resource vcf but then analyzing the control samples with a treatment —germline-resource so that I get individual outputs for all files (I know it may seem a little funky)….

“If you follow the Mutect2 tutorial sans a matched normal BAM, then for the —germline-resource you can provide a callset representing the control group. This germline resource can use actual allele frequencies representing the control group samples. You would analyze each of the test group’s samples against the germline resource in independent Mutect2 runs.”

That all being said, how would this affect the creation or requirement of the PoN vcf? And how do the germline resource and PoN differ in the outcome?

From shlee on 2018-02-27

No bother at all. Your questions are what the forum is for, and you should know that your courage to ask helps others who read.

> curious if the panel of normals (PoN) file was necessary to run the program.

Technically, the tool does not require it.

Scientifically, I should think you would want it. Please see the paragraph that starts with “The panel of normals helps filter systematic artifacts of sequencing.” in [this document](https://gatkforums.broadinstitute.org/dsde/discussion/11127/) as to why. I imagine it would be easy for you to construct a panel from your current samples. Remember the PoN retains variants present in two or more samples as called by Mutect2. If you think the chances of a mutation arising in the same site in two samples is somehow higher for your case, then you can revert back to making the PoN the GATK3 way, which allows you to set the number of samples the retained sites must share, e.g. three or four instead of two.

To reiterate, any rare mutation analysis involving short read sequencing technology should be keen on incorporating a panel of normals to exclude artifacts of sequencing. The organism at hand should not be a factor in this decision. Actually, we know GRCh38 is one of the better reference genomes available, where the reference was tweaked to minimize uneven mapping and other empirical artifacts. I do not know the state of the quality of other reference genomes, but if they are less than GRCh38, then data mapped to such genomes, assuming similar quality of reads, benefit even more from using a PoN.

[Section 2 of Tutorial#11136](https://software.broadinstitute.org/gatk/documentation/article?id=11136#2) outlines the two steps to create a PoN using GATK4. [Section 1 of Tutorial#9183](https://gatkforums.broadinstitute.org/gatk/discussion/9183) outlines the steps to create a PoN using GATK3, which is where you can change the minimum number of samples a variant is observed in to be included in the PoN.

> I was curious if I do not have a model organism with known or confident sites, should I provide a PoN or can I leave this flag out? It appears that this flag has been created for those doing cancer research and I am not sure if it is applicable to my research. If it is required, how would you recommend I build and filter this type of vcf in my case?

Remember that germline calling with HaplotypeCaller has different criteria for passing calls than somatic calling with Mutect2. You can run HaplotypeCaller on your cohort of samples and filter using recommended filtering parameters (hard filtering or VQSR) for confident germline variant sites. If you have a database of commonly germline variant sites for your organism, then you should definitely take advantage of it towards determining which of your cohort calls are confident (using VQSR). The germline callset will differ from the somatic calls you get with Mutect2, given somatic calling must account for fraction purity and heterogeneity common in cancer samples. Here you leverage the germline callset towards assessing the cutoff for variant inclusion in the PoN.

> With further consideration and thought, I have decided to take your advice one building the —germline-resource vcf but then analyzing the control samples with a treatment —germline-resource so that I get individual outputs for all files (I know it may seem a little funky)…. That all being said, how would this affect the creation or requirement of the PoN vcf? And how do the germline resource and PoN differ in the outcome?

You use the `—germline-resource` argument to provide the matched sample’s Mutect2-called variant sites against which you want to contrast variant alleles. This will call all contrastive variants, some of which are likely artifacts. The PoN allows you to filter the artifactual sites. In the tutorial PoN, I do a hand-wavy approximation of masking germline sites in the PoN. I should think the resolution of this is dependent on how you determine which sites are truely germline variant sites. Towards this, the GATK offers [VQSR](https://gatkforums.broadinstitute.org/gatk/discussion/39/variant-quality-score-recalibration-vqsr/p1) (recommended), [hard filtering guidelines](https://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set/p1) and soon (in beta) [a deep learning approach to variant filtering](https://software.broadinstitute.org/gatk/blog?id=10996) called GATK4 CNN.

I hope this clarifies things for you.

P.S. I have numbers to base my above comments on. The PoN filters a surprising amount of calls and I think it more likely these are artifacts than mutations of interest. But don’t just take my word for it. You should do a comparison +/-PoN for your project. Try it out on some small data and see what is the difference.

From JmeAlena on 2018-03-01

To be sure that I am reading this correctly, are you saying that I should use variant recalibration on my first pass vcf files from Mutect2 and use the outputs of that to create my PoN for a non-model organism? Or can the CreateSomaticPanelofNormals tool calculate artifactual and germline variants on its own, where I could use the first pass vcf files from Mutect2 to directly create my PoN?

Also, what do you mean by “In the tutorial PoN, I do a hand-wavy approximation of masking germline sites in the PoN.”

From sergey_ko13 on 2018-03-14

Is there any argument against using vcf from first step as -L for the third step for the same sample?

Thank You!

From shlee on 2018-03-14

@JmeAlena,

GATK’s variant recalibration workflows are meant for germline callsets, e.g. those made with HaplotypeCaller. You should not run your Mutect2 results through these as the annotations in the callsets differ and those that germline recalibration relies on are absent in Mutect2 calls generated with the recommended default arguments.

CreateSomaticPanelOfNormals is a straightforward tool that simply retains sites where variants are called in two or more samples.

For the off-label tutorial, I remove sites in the PoN that are confidently called in sample A and sample B by GenotypeGVCFs. Remember that GenotypeGVCFs calls variants on per-sample GVCFs produced by HaplotypeCaller’s `-ERC GVCF` mode. Then, for the masking, I use this callset plus a five base mask extension to create the `pon_maskAB.vcf` masked-PoN. I say this is a hand-wavy approximation because (i) I relied on the cutoffs in HaplotypeCaller/GenotypeGVCF to determine passing sites without any fancy population-based filtering like VQSR nor hard-filtering, and (ii) I arbitrarily selected a five base mask without any additional sanity checks or validation to tune appropriateness. I made the motions of masking sites without really checking to see the effects of the approach on sensitivity. To obtain the most sensitive results, albeit with many artifactual calls, omit the PoN altogether.

One thing I wanted to let you know is that our developers have decided to allow Mutect2 to emit germline calls for the normal sample. Code changes are at https://github.com/broadinstitute/gatk/pull/4522, and so these changes will show up in a near-future release of GATK4.

—-

@sergey_ko13,

The `-L` options should be used cautiously with any of our assembly-based callers, including Mutect2. You need to be sure to provide the caller sufficient padding around the sites of interest, e.g. with the `-ip` option. I do not recommend using small intervals like those from variant calls alone. For consistent and reproducible calling, it’s best to allow the tool to traverse the entirety of contigs.

From JmeAlena on 2018-03-15

@shlee

Thank you! I am greatly appreciative of the level of help you have provided on this topic!

I am going to give PoN a try and compare it to calls without PoN. You mention in a previous comment that hard filtering or VQSR can be used. I have a set of calls from HaplotypeCaller, not run on genotype mode and filtered them based upon the recommendations of this [article](https://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set/p1 “article”). Is that acceptable or is genotype mode required? I just want to make sure that I am distinguishing between required actions and suggested options.

I have one other question. You mentioned using CombineVariants for the —germline-resource file but CombineVariants is no longer available as a tool in GATK4. Would CatVariants be more appropriate or are there any necessary flags that should be used for Mutect2 to appropriately read the —germline-resource file. I have some loci indicate “set=intersection” or “set=variant” in the info field after using CombineVariants from an older version of gatk and wanted to ensure that this is not problematic for downstream analyses. For example, should I use…

-genotypeMergeOptions UNIQUIFY OR —variant:foo input1.vcf \ —variant:bar input2.vcf \ -genotypeMergeOptions PRIORITIZE \ -priority foo,bar

From shlee on 2018-03-16

Hi @JmeAlena,

I meant to mention Mutect2 has the `—genotype-pon-sites` to include variant sites that are coincident with those in the PoN in the callset, even if they will eventually be filtered.

Can you clarify what you mean by genotype mode? Do you mean the `-ERC GVCF` mode? I make no suggestions nor list any requirements. Remember that this workflow that I’ve proposed is an off-label workflow for you to use at your own risk. It is not an officially supported workflow.

At this point, I think we should involve others in the discussion. I think @davidben would love to hear about your research aims and how this workflow helps you towards them.

It is in the process of being carried over to GATK4. Until this port is complete, you can use CombineVariants from GATK3. I recommend using that from v3.7. You should decide whether additional annotations are useful to you. It is possible to omit the `set` annotations by setting `—setKey null`. Finally, the tool does not require you set any `—genotypemergeoption`. If and how you set this is up to you and depends on what you are combining and how you want variants prioritized.

From JmeAlena on 2018-03-19

Ok great! So far everything is working well, thanks to your advice. If I run into downstream issues, I will let you know.

From shlee on 2018-03-20

Great to hear @JmeAlena.

From JmeAlena on 2018-04-11

Hi @shlee

I had some concerns about my outputs this off-label workflow. We have talked about my experiment a little in previous comments and threads. This may be more appropriate for other threads but I figured I would start here since I used this workflow to generate results.

Mutect2 is calling low frequency variants for me, which is fantastic! However, I re-ran mutect2 with -bamout to visualize my mutations in IGV and how they compare to other samples. My concern is that a mutation will be listed in one sample but not others but I can see in the bam that they do actually exist in the other bam files. It is important to my research to have confidence that it can distinguish novel mutations.

How concerning is this? Is this related to the calculation of uninformative reads discussed in this [article](https://software.broadinstitute.org/gatk/documentation/article.php?id=6005 “article”)? Is there a way around this or what do you recommend?

From shlee on 2018-04-12

Remember @JmeAlena, that the bamout also contains artificial haplotypes. It’s hard to comment on what may be going on without seeing the alignments. Would it be possible for you to post IGV screenshots of what you are concerned about?

From JmeAlena on 2018-04-13

![](https://us.v-cdn.net/5019796/uploads/editor/dv/cnnjdybz8fl3.png “”)

So it appears that the mapping of some regions occur in some samples but not others and this may be an effect of dealing with RNAseq. The combined VCF is at the bottom under the individual bamouts from mutect2. Shown in this screenshot the Mutect2 outputs indicate that only three samples have the mutation but as you can see an additional sample also contains that variant (I arbitrarily chose a locus, I can provide more screenshots if needed).

I understand that some may be considered artifactual but with RNAseq could this be attributed to low depth if it is reporting the same alternate allele? With this being said, I cannot guarantee that the variant doesn’t exist in other samples and this limits my confidence in the somatic mutations output by Mutect2.

I understand that not all of these questions may not be your expertise but if there is anything I could have done or could do to prevent this or alleviate the concern from this please let me know.

From Sheila on 2018-04-18

@JmeAlena

Hi,

To confirm, the variants shown in the 4 BAM files are in the bamouts? Then, in the final VCF, only 3 out of 4 samples showing variants in the bamouts are called as variant?

Thanks,

Sheila

From JmeAlena on 2018-04-18

Yes that is correct.

From JmeAlena on 2018-04-18

At first I thought that maybe these were occurring at lower frequencies than what Mutect2 could detect without considering it artifactual but in many cases they are all occurring at approximately the same depth.

From pearlpearl on 2018-04-19

Hi @shlee

This is a very nice post. I am a little bit confused about the pon file here

For the command you use here,

gatk SelectVariants –V pon.vcf -XL AandB_haplotypecaller.vcf –O pon_maskAB.vcf

if this is a two sample compare study, what sample you use to create this “pon.vcf”, is it from one of the sample or both?

Also could you please give more explanations on the five base mask extension?

does that mean when you feed the haplotypecaller.vcf to gatk SelectVariants, you extend the haplotypecaller.vcf region to five bases in both direction?

Thanks

From pearlpearl on 2018-04-19

Hi I used my own way provided below to incoprated this step to the whole workflow:

[2] Modify A.table in Excel, e.g. using the CONCATENATE function and add the # in front of CHROM. newline="#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO" sed "1c $newline" $A.table > $A.new.table awk 'BEGIN{OFS="\t"}{if($1!~/#/) {$8="TLOD="$6";AF="$8;$6=".";$7="."}{print $0} }' $A.new.table > $A.af.table [3] Modify the VCF header. Subset out the header then use nano to modify. change (i) FORMAT-->INFOand (ii) use CTRL+k to delete unnecessary lines, etc. cat $A.vcf |grep '##' |sed 's/##FORMAT/##INFO/g'> $A.header

[4] Concatenate header and body. cat $A.header $A.af.table > $A.af.vcf

[5] Validate file and index vcf file. gatk ValidateVariants -V Aaf.vcf gatk --java-options "-Xmx40g -Djava.io.tmpdir=$TempDir" ValidateVariants -V $A.af.vcf

@shlee said: Hi @JmeAlena,

Thank you!

I do some hackery to reformat the data into these files. I'm pretty sure there are better ways so I left these details out.

What I do involves Excel and is as follows. Please do share any better approaches.

[1] Convert VCF data to table format using VariantsToTable.

gatk VariantsToTable \ -V A.vcf \ -F CHROM -F POS -F ID -F REF -F ALT -F TLOD \ -GF GT -GF AF \ -O A.table

[2] Modify A.table in Excel, e.g. using the CONCATENATE function:

[3] Modify the VCF header.

Subset out the header then use nano to modify.

cat A.vcf | grep '##' > A.header nano A.header

Manually change (i) FORMAT-->INFOand (ii) use CTRL+k to delete unnecessary lines, etc.

[4] Concatenate header and body.

cat A.header Aaf.table > Aaf.vcf

[5] Validate file.

gatk ValidateVariants -V Aaf.vcf

From Sheila on 2018-04-23

@pearlpearl

Hi,

>if this is a two sample compare study, what sample you use to create this “pon.vcf”, is it from one of the sample or both? Have a look at [this article](https://software.broadinstitute.org/gatk/documentation/article?id=11053).

I will ask Soo Hee to get back to you about your other question.

-Sheila

From JmeAlena on 2018-04-24

Any updates about my question above?

From shlee on 2018-04-25

Hi @JmeAlena,

Thanks for your patience. Here are my thoughts.

First, in combining your M2 callsets, did you perchance remove filtered calls? It would be good to know whether M2 called a variant for your S6 sample that was subsequently filtered.

Second, remember that Mutect2/FilterMutectCalls parameters are tuned for DNA-Seq, not RNA-Seq, and also are meant to be stringent. So perhaps you should consider tuning the parameters to better suit your RNA data type. There are three stages to consider: (i) upfront read filtering, (ii) assembly machinery and (iii) hard-filtering based on annotation values. I noticed from your screenshot that the coverage for S6 is much lower (30 or 50, hard to tell with resolution) than S7-S9 samples (each >100). However, depth isn’t the only factor in calling confident variants. The thing to do is find out at what stage your S6 sample variant is missed, figure out why it was missed (IGV will help here) and see if tuning parameters allows for it to be called, e.g. try `—disable-tool-default-read-filters`, and decide if the new setting is appropriate for your entire analysis, etc.

Third, in studying BAMOUTS, folks typically compare it side-by-side to the RAW BAM, i.e. the BAM without reassembly.

From shlee on 2018-04-25

Hi @pearlpearl,

Thank you for contributing sed and awk commands. These are great.

I just used the forty-sample PoN from the [Mutect2 tutorial](https://gatkforums.broadinstitute.org/gatk/discussion/11136). It is important to account for sequencing artifacts, which PoNs do very well even if limited in cohort size.

As for the five-base mask extension, looking back on my notes, I see that I use `—mask-extension 5` with GATK’s [VariantFiltration](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_filters_VariantFiltration.php) tool. The command looks something like the following:

```

gatk VariantFiltration \

-V m2pon.vcf.gz \

—mask hc_tn_genotypegvcfs.vcf.gz \

—mask-extension 5 \

—mask-name hcc1143_hctngenotypegvcf \

-O m2pon_mask_hcc1143_hctngenotypegvcf.vcf.gz

```

You are asking why I use this extension. In this step, I want to (i) remove common germline variant sites that are variant in samples A and B, so that the workflow can call on them, and (ii) retain all other sites in the PoN that represent artifacts and possible contamination. To quote myself in an answer to @JmeAlena,

> I arbitrarily selected a five base mask without any additional sanity checks or validation to tune appropriateness.

Remember that my role with the GATK is to communicate tool features and workflows. The Communications team is not involved in actual research endeavors nor validations of tools nor validations of recommended workflow parameters. This particular workflow is something I developed in a limited time, to show that Mutect2 could be used in a different way than showcased in the official workflow. I hope that researchers will do their own due-diligence towards the parameters they will use.

I use the extension to illustrate that for the sites of interest that are masked, any slight variations in variant call representation are not a factor towards retaining sites in the PoN. To be clear, GATK callers by design left-align variants but this cannot be said to be true for all callers or database resources out there that folks may decide to use as the source of the sites they wish to mask. Also, regions of low-complexity and repeats can be represented differently, based on the read data, even despite left-alignment. So additionally masking areas around AB calls seemed prudent at the time.

From JmeAlena on 2018-04-27

Thank you @shlee

In the past I have compared filtered and unfiltered vcfs to indicate whether it was missed or filtered. I will attempt that again and get back to you. I have followed the best practices for RNAseq but essentially replaced haplotype caller with this workflow.

As for the bamouts, I excluded raw bams for visual simplicity but in some cases the regions between the raw and bamout are drastically different. Is this common?

From Sheila on 2018-05-01

@JmeAlena

Hi,

What do you mean by “drastically different”? It is possible in messy regions for the reads to be realigned and the soft clips are gone in the bamout (because an indel has been uncovered), or the region is cleaned up in general. Also, it is possible some reads are filtered out and not output in the bamout.

-Sheila

From EADG on 2018-05-07

Hi shlee, Sheila and the other people who may still interested in that topic,

I wrote a little not so shiny java-application to create the two site-only-af.vcf, which you need for that workflow. For all the people who don’t want to use excel or the sed/awk solution from @pearlpearl .

In addition, I create an example wdl-file+dockerImage so you can easily use it on firecloud if you want.

Just take a look at https://github.com/Selonka/premusa/

Greets,

EADG

From shlee on 2018-05-07

Wow, @EADG. Thanks for sharing.

From liuxq on 2018-06-21

@shlee , great article. I am still confused about the LOH found by this off-label use. For a specific site, if reference allele is A, normal sample has heterozygous mutation, AB, and tumor sample lost the A allele with only mutant allele B left. Can such kind of LOH be detected by this workflow? thanks.

From Sheila on 2018-06-24

@liuxq

Hi,

It seems like there are plans to implement this in the future. Keep an eye on [this ticket](https://github.com/broadinstitute/gatk/issues/2934) for updates.

-Sheila

From husee on 2018-07-08

I was a liittle confused about the command gatk SelectVariants –V pon.vcf -XL AandB_haplotypecaller.vcf –O pon_maskAB.vcf ,the -XL needs a intervals not a vcf file. In the comment, I found the VariantFiltration , are they the same ?

From Geraldine_VdAuwera on 2018-07-10

@husee -L and -XL can both take either an intervals file or a VCF file. If you use a VCF file the program will use the variant positions as interval locations.

From LJY on 2018-07-29

Hi, thanks for the wonderful article. I wonder whether you have a similar piece for calling the differences (SNPs, Indels) between two samples using RNA-seq? Say I have one cancer cell line and expose it to a drug and RNA seq both the original and after treatment (triplicate samples for each). Would that be wise if I use the recommedation from https://gatkforums.broadinstitute.org/gatk/discussion/3891/calling-variants-in-rnaseq for data clean up and use the work flow in this article? Thank you.

From Sheila on 2018-08-04

@LJY

Hi,

For somatic variant detection, you should follow the workflow described here. The Best Practice doc for RNA-seq you pointed to is for germline variant calling.

-Sheila

From LJY on 2018-08-05

@Sheila

Hi Sheila,

Thanks for the reply. Yes I do understand that the Best Practice doc for RNA-seq is for germline variant calling. However, should I use the data cleaning process (2-pass Star alignment, mark duplicate, sort, split“N“trim, Indel Ralignment and base recalibration) described in that article for RNA-seq data, before using the workflow here for calling “somatic” variants.

Thanks,

From Geraldine_VdAuwera on 2018-08-05

Yes you should, with the caveat that some people like to include a somatic variation database (eg COSMIC) in the BQSR step when looking for somatic variants. We have not done a systematic analysis of the benefits of that practice but the reasoning is sound.

From LJY on 2018-08-05

Thanks Geraldine. Very helpful input as usual.

From mine on 2018-09-27

Hi,

I have RNA-seq breast cancer data. UnfortunateIy, I have not matched samples. But I have some normal samples that are taken from cancerous people. So I have created my PON file by using these samples.

When I give zero to parameters —af-of-alleles-not-in-resource and —max-population-af as mentioned in here, it will give less “PASS” variations than the results obtained by default values of these parameters.

Which one should I use? I am a little bit confused? Should I give zero or use default values of these parameters?

If you can answer and help, I will be very pleased. Thank you.

From kittisakmeaw on 2018-10-19

I have more news than this. But thanks for bringing it to me.:mm

From dario_romagnoli on 2018-11-29

Hello @shlee, can you also explain how was `AandB_haplotypecaller.vcf` generated?

Report abuse