Combining variants from different files into one

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by delangel

on 2012-07-23

Solutions for combining variant callsets depending on purpose

There are three main reasons why you might want to combine variants from different files into one, and the tool to use depends on what you are trying to achieve.

The most common case is when you have been parallelizing your variant calling analyses, e.g. running HaplotypeCaller per-chromosome, producing separate VCF files (or gVCF files) per-chromosome. For that case, you can use a tool called CatVariants to concatenate the files. There are a few important requirements (e.g. the files should contain all the same samples, and distinct intervals) which you can read about on the tool's documentation page.
The second case is when you have been using HaplotypeCaller in -ERC GVCF or -ERC BP_RESOLUTION to call variants on a large cohort, producing many gVCF files. We recommend combining the output gVCF in batches of e.g. 200 before putting them through joint genotyping with GenotypeGVCFs (for performance reasons), which you can do using CombineGVCFs, which is specific for handling gVCF files.
The third case is when you want to combine variant calls that were produced from the same samples but using different methods, for comparison. For example, if you're evaluating variant calls produced by different variant callers, different workflows, or the same but using different parameters. This produces separate callsets for the same samples, which are then easier to compare if you combine them into a single file. For that purpose, you can use CombineVariants, which is capable of merging VCF records intelligently, treating the same samples as separate or not as desired, combining annotations as appropriate. This is the case that requires the most preparation and forethought because there are many options that may be used to adapt the behavior of the tool.

There is also one reason you might want to combine variants from different files into one, that we do not recommend following. That is, if you have produced variant calls from various samples separately, and want to combine them for analysis. This is how people used to do variant analysis on large numbers of samples, but we don't recommend proceeding this way because that workflow suffers from serious methodological flaws. Instead, you should follow our recommendations as laid out in the Best Practices documentation.

Merging records across VCFs with CombineVariants

Here we provide some more information and a worked out example to illustrate the third case because it is less straightforward than the other two.

A key point to understand is that CombineVariants will include a record at every site in all of your input VCF files, and annotate in which input callsets the record is present, pass, or filtered in in the set attribute in the INFO field (see below). In effect, CombineVariants always produces a union of the input VCFs. Any part of the Venn of the N merged VCFs can then be extracted specifically using JEXL expressions on the set attribute using SelectVariants. If you want to extract just the records in common between two VCFs, you would first CombineVariants the two files into a single VCF, and then run SelectVariants to extract the common records with -select 'set == "Intersection"', as worked out in the detailed example below.

Handling PASS/FAIL records at the same site in multiple input files

The -filteredRecordsMergeType argument determines how CombineVariants handles sites where a record is present in multiple VCFs, but it is filtered in some and unfiltered in others, as described in the tool documentation page linked above.

Understanding the set attribute

The set property of the INFO field indicates which call set the variant was found in. It can take on a variety of values indicating the exact nature of the overlap between the call sets. Note that the values are generalized for multi-way combinations, but here we describe only the values for 2 call sets being combined.

set=Intersection : occurred in both call sets, not filtered out
set=NAME : occurred in the call set NAME only
set=NAME1-filteredInNAME : occurred in both call sets, but was not filtered in NAME1 but was filtered in NAME2
set=filteredInAll : occurred in both call sets, but was filtered out of both

For three or more call sets combinations, you can see records like NAME1-NAME2 indicating a variant occurred in both NAME1 and NAME2 but not all sets.

You specify the NAME of a callset is by using the following syntax in your command line: -V:omni 1000G_omni2.5.b37.sites.vcf.

Emitting minimal VCF output

You can add the -minimalVCF argument to CombineVariants if you want to eliminate unnecessary information from the INFO field and genotypes. In that case, the only fields emitted will be GT:GQ for genotypes and the keySet for INFO.

An even more extreme output format is -sites_only (a general engine capability listed in the CommandLineGATK documentation) where the genotypes for all samples are completely stripped away from the output format. Enabling this option results in a significant performance speedup as well.

Requiring sites to be present in a minimum number of callsets

Sometimes you may want to combine several data sets but you only keep sites that are present in at least 2 of them. To do so, simply add the -minN (or --minimumN) command, followed by an integer if you want to only output records present in at least N input files. In our example, you would add -minN 2 to the command line.

Example: intersecting two VCFs

In the following example, we use CombineVariants and SelectVariants to obtain only the sites in common between the OMNI 2.5M and HapMap3 sites in the GSA bundle.

# combine the data java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R bundle/b37/human_g1k_v37.fasta -L 1:1-1,000,000 -V:omni bundle/b37/1000G_omni2.5.b37.sites.vcf -V:hm3 bundle/b37/hapmap_3.3.b37.sites.vcf -o union.vcf # select the intersection java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T SelectVariants -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -L 1:1-1,000,000 -V:variant union.vcf -select 'set == "Intersection";' -o intersect.vcf

This results in two vcf files, which look like:

# contents of union.vcf 1 990839 SNP1-980702 C T . PASS AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection 1 990882 SNP1-980745 C T . PASS CR=99.79873;GentrainScore=0.7403;HW=0.005225421;set=omni 1 990984 SNP1-980847 G A . PASS CR=99.76005;GentrainScore=0.8406;HW=0.26163524;set=omni 1 992265 SNP1-982128 C T . PASS CR=100.0;GentrainScore=0.7412;HW=0.0025895447;set=omni 1 992819 SNP1-982682 G A . id50 CR=99.72961;GentrainScore=0.8505;HW=4.811053E-17;set=FilteredInAll 1 993987 SNP1-983850 T C . PASS CR=99.85935;GentrainScore=0.8336;HW=9.959717E-28;set=omni 1 994391 rs2488991 G T . PASS AC=1936;AF=0.69341;AN=2792;CR=99.89378;GentrainScore=0.7330;HW=1.1741E-41;set=filterInomni-hm3 1 996184 SNP1-986047 G A . PASS CR=99.932205;GentrainScore=0.8216;HW=3.8830226E-6;set=omni 1 998395 rs7526076 A G . PASS AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection 1 999649 SNP1-989512 G A . PASS CR=99.93262;GentrainScore=0.7965;HW=4.9767335E-4;set=omni # contents of intersect.vcf 1 950243 SNP1-940106 A C . PASS AC=826;AF=0.29993;AN=2754;CR=97.341675;GentrainScore=0.7311;HW=0.15148845;set=Intersection 1 957640 rs6657048 C T . PASS AC=127;AF=0.04552;AN=2790;CR=99.86667;GentrainScore=0.6806;HW=2.286109E-4;set=Intersection 1 959842 rs2710888 C T . PASS AC=654;AF=0.23559;AN=2776;CR=99.849;GentrainScore=0.8072;HW=0.17526293;set=Intersection 1 977780 rs2710875 C T . PASS AC=1989;AF=0.71341;AN=2788;CR=99.89077;GentrainScore=0.7875;HW=2.9912625E-32;set=Intersection 1 985900 SNP1-975763 C T . PASS AC=182;AF=0.06528;AN=2788;CR=99.79926;GentrainScore=0.8374;HW=0.017794203;set=Intersection 1 987200 SNP1-977063 C T . PASS AC=1956;AF=0.70007;AN=2794;CR=99.45917;GentrainScore=0.7914;HW=1.413E-42;set=Intersection 1 987670 SNP1-977533 T G . PASS AC=2485;AF=0.89196;AN=2786;CR=99.51427;GentrainScore=0.7005;HW=0.24214932;set=Intersection 1 990417 rs2465136 T C . PASS AC=1113;AF=0.40007;AN=2782;CR=99.7599;GentrainScore=0.8750;HW=8.595538E-5;set=Intersection 1 990839 SNP1-980702 C T . PASS AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection 1 998395 rs7526076 A G . PASS AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection

Updated on 2015-05-16

From mpvivero on 2013-05-13

I am trying to use a variety of the GATK variant validation tools including CombineVariants and SelectVariants. However, I have been unable to get either tool to work. I have posted the first few lines of one of my VCFs below:

#CHROM POS ID REF ALT QUAL FILTER INFO

chr1 762084 . T C . . .

chr1 762136 . A C . . .

chr1 762189 . A C . . .

chr1 762192 . T C . . .

chr1 762195 . C A . . .

chr1 762196 . T C . . .

As you can see, all I care about is the exact chromosomal location (based on the hg19 ref). I want to start by merging two different VCFs using the following code:

java -jar GenomeAnalysisTK.jar -T CombineVariants -R hg19.fasta -V:ABC,abc.vcf -V:XYZ,xyz.vcf -o merged_abc_xyz.vcf -minimalVCF

After, I will be extracting variants that intersect both sets using SelectVariants.

The program completes without error, however, my merged VCF output does not populate with any data beyond the header (ie. no variants are listed). The program only runs for under a second and the completes and shuts off. What could be my issues? Please help. Thanks!!

From Geraldine_VdAuwera on 2013-05-15

What is the output to the console? Is there an error or other message?

From Carneiro on 2013-05-15

are abc.vcf and xyz.vcf well formed? How were they generated?

From mpvivero on 2013-05-15

Hi Geraldine and Carneiro,

Thanks for the responses. It turns out, the VCF files were slightly malformed on a couple of lines. They were not generated using a GATK variant caller which is probably why CombineVariants wasn’t working. For now, I think I have everything under control after some manual reformatting.

Will get back to you if anything else comes up.

From ecyeh on 2013-05-21

It is a nice tool, thank you.

Suppose I have a vcf file containing the genotype of 5 samples, and want to find variants occurring in at lease 3 of them. To get the “set=” attribute, do I need to split the vcf file into 5 ones first, by using SelectVariant, and then merge the 5 vcfs again using CombineVariants? Is there a better way to do that?

From Carneiro on 2013-05-21

@ecyeh use SelectVariants to find the variants on at least 3 of them. If you only have 5 samples, there aren’t too many combinations. If you had more samples than that, I’d write a walker to do it.

From ecyeh on 2013-05-23

Indeed I have more than 5 samples to compare. Thank you for the clarification. Now I feel so lucky to be a programmer.

From Carneiro on 2013-05-23

fantastic, go ahead and write a quick RODWalker to solve that one. You can base yourself on SelectVariants.

From chongm on 2013-10-29

Hi, I’m combining different samples (each VCF has a “batch” of samples that were processed together and I just wanted to have my variants in one VCF for convenient’s sake – i.e. I’m not combining variants in order to merge samples that were spread out on different runs). I was wondering why the PL changes for each genotype when I’m simply putting all my variants in one place.

From weberATillinoisedu on 2013-10-29

I think that there is a typo in Example 8. “-select ‘set == “;Intersection”;’” should lose that first semi-colon. With the semicolon, I get 0 VCF records produced. When I use “grep” there are 2300+ in my union.vcf data set. After removal, the resulting intersection.vcf file contains the expected number of records. I expect that because “;” is the separator char, it shouldn’t be in the pattern. This was tested using v2.7-2 and v2.7-4.

I hope this helps anyone else who was tripped up by it.

From Geraldine_VdAuwera on 2013-10-29

Hi @chongm,

First, be careful when you merge VCFs containing variants that were called separately. If you’re doing it to compare results of different calling iterations, evaluate concordance etc, that’s perfectly fine. But combining them for convenience is dangerous because if later, you want to filter them, there are some annotations that you shouldn’t use to filter them together, because the values are relative, not absolute, and will not be on the same scale between different sets. I’m not saying you shouldn’t do it, but if you do it, you should be careful.

The PLs changing is a known issue that occurs when combining variants with more than one alternate allele. CombineVariants currently does not handle multiallelic variants well.

From Geraldine_VdAuwera on 2013-10-29

@weberATillinoisedu, you’re absolutely correct. I will fix the typo in the article. Thanks for pointing it out!

From chongm on 2013-10-30

Hi @Geraldine_VdAuwera, okay thanks for the warning. I will filter each VCF separately then. Once all batches are complete though, I’m going to call variants simultaneously for all batches which should result in one VCF. This is the best way to call variants right? Will some less frequent variants be filtered out if I call them using the entire cohort of samples vs. run by run?

From Geraldine_VdAuwera on 2013-10-30

Yes, joint calling followed by VQSR (variant recalibration) is indeed what we recommend. The process will increase your discovery power for difficult variants, but should not negatively affect the calling of rare variants.

From evakoe on 2014-03-20

It would be great if you could extend the `-setKey` argument so that one can not only specify the key, but also the values. If I combine three VCF files, let’s say a.vcf, b.vcf, and c.vcf, the INFO field might look like `set=variant-variant1`. I assume “variant” corresponds to the first file given with `-V` and “variant1” to the second file given with `-V`, but of course to help me still know this in 1 month it would be great if I could specify other strings for “variant” and “variant1”. Or to make thinks easier, you could also just take the file prefixes a, b, and c. Thanks. Eva

From evakoe on 2014-03-20

Actually, looking closer at my new VCF file combined from three VCF files, I now see that the set argument has four different values, even though I merged only three VCF files. These values are variant, variant1, variant2, and variant3.

I used the option `-setKey source` and what I see now are: `source=variant3`, `source=variant2`, `source=variant-variant2`, but no `source=variant1`, `source=variant` or `source=variant-variant1`. Is it possible that there is a bug with the naming of the sets when it comes to the combinations?

Thank you

Eva

From ebanks on 2014-03-20

Hi Eva,

You just need to name your tracks. E.g. “-V:foo first.vcf -V:bar second.vcf”, then the set values will be ‘first’ and ‘second’.

From evakoe on 2014-03-20

Hi Eric,

thank you very much for this info, I did not know about this option. When I name the input VCFs, I really only observe the three values that I specified.

If you want to take a closer look at why there are four different values when you don’t name the inputs I can send you the three input vcfs I used.

From varsha on 2015-10-06

Hi @ebanks, I saw some previous documentation on using genotypeMergeOptions REQUIRE_UNIQUE and UNIQUIFY and also -‐filteredrecordsmergetype KEEP_IF_ANY_UNFILTERED -‐filteredAreUncalled for combining variants but I am not sure which of these is ideal for generating PON vcf with only 1 record for each variant (record positions with the same variant in the same location only once). Please let me know where I can find documentation regarding this. Sorry for the trouble, your help is much appreciated. I have looked for the info in these threads

https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_CombineVariants.php

http://gatkforums.broadinstitute.org/discussion/53/combining-variants-from-different-files-into-one

http://gatkforums.broadinstitute.org/discussion/4641/build-a-panel-of-normal-for-mutect

From Sheila on 2015-10-12

@varsha

Hi,

Are you trying to merge VCFs with the same sample names? -genotypeMergeOptions REQUIRE_UNIQUE makes sure all the sample names are different in the input VCFs. -genotypeMergeOptions UNIQUIFY adds a suffix to any sample names that are the same in the VCFs so you can distinguish them. Both options will create a single record for any site that exists in your VCFs.

-Sheila

From Geraldine_VdAuwera on 2015-10-13

It doesn’t matter how you choose to merge genotypes because when MuTect uses the PON, it ignores the genotype information and only looks at site-level information.

From varsha on 2015-10-13

Hi Sheila, Geraldine_VdAuwera, Thank you for your help. I did give all different sample names for the normal vcfs to be merged. I ended up using —genotypemergeoption UNSORTED and it seemed to work fine. I am currently using this PON.vcf generated to run MuTect on my unpaired samples using —normal_panel PON.vcf. I am waiting to see if this works without any errors, I will get back with any updates/ issues. Thanks again.

From varsha on 2015-10-21

Hi would you be able to specify the filteredrecordsmergetype to be used for combining the vcfs for PON? I am unable to narrow down the exact syntax from the forum posts. Thank you for your help.

From Geraldine_VdAuwera on 2015-10-21

Just use `—filteredrecordsmergetype KEEP_IF_ANY_UNFILTERED `

From varsha on 2015-10-21

Hi @Geraldine_VdAuwera, I used both —genotypemergeoption UNSORTED —filteredrecordsmergetype KEEP_IF_ANY_UNFILTERED to generate my PON vcf successfully , thanks again!

From everestial007 on 2015-12-10

Carneiro said: >

ecyeh use SelectVariants to find the variants on at least 3 of them. If you only have 5 samples, there aren’t too many combinations. If you had more samples than that, I’d write a walker to do it.

Geraldine_VdAuwera Sheila

I am in a similar situation. I have vcf from 6 different samples and want to select variants that are present in at least 3 of them. I have used combine variants (with -minN 3) to merge and retain the variants that are present in at least 3 different samples. The resulting *.vcf file using this trick should be equivalent to the one if I had first merged the variants and then selected the ones present in at least 3 samples, rite??

But, I want to know how to used select variants to find the variants that are present in at least 3 samples. I have used -select ‘set == “Intersection”;’ to obtain the variants that are in all 6 samples. But, finding the right flag for selecting variants from at least 3 sample has been a problem. I have checked the select variants tool page but don’t find any thing useful.

Thank you in advance,

From everestial007 on 2015-12-11

After I combine variants (or select variants), I want to use the combined variants.vcf for VQSR. If I want to remove any field and/or info that is not useful for VQSR, what/how can I do it? Also, I am not sure what fields should be retained so the vcf is useful for VQSR analyses.

To remove or retain the field of interest I came across some complex commands using awk and/or python. Is there any command in GATK or vcf tools which could be useful for the purpose.

Thank you !

From everestial007 on 2015-12-11

everestial007 said: > >

Carneiro said:

> >

ecyeh use SelectVariants to find the variants on at least 3 of them. If you only have 5 samples, there aren't too many combinations. If you had more samples than that, I'd write a walker to do it. > >

Geraldine_VdAuwera @Sheila

> I am in a similar situation. I have vcf from 6 different samples and want to select variants that are present in at least 3 of them. I have used combine variants (with -minN 3) to merge and retain the variants that are present in at least 3 different samples. The resulting *.vcf file using this trick should be equivalent to the one if I had first merged the variants and then selected the ones present in at least 3 samples, rite??

> But, I want to know how to used select variants to find the variants that are present in at least 3 samples. I have used -select ‘set == “Intersection”;’ to obtain the variants that are in all 6 samples. But, finding the right flag for selecting variants from at least 3 sample has been a problem. I have checked the select variants tool page but don’t find any thing useful.

> Thank you in advance,

Looks like the above question had some typos which made the question unclear. I am sorry for that. I have reworded the question again:

I am in a similar situation. I have vcf from 6 different samples and want to select variants that are present in at least 3 of them. I have used combine variants (with -minN 3) to merge and retain the variants that were present in at least 3 different samples.

java -Xmx4g -jar GenomeAnalysisTK.jar -T CombineVariants -R lyrata_genome.fa —variant:varMA605 commonVARiantsMA605PRIORITY.vcf —variant:varMA611 commonVARiantsMA611PRIORITY.vcf —variant:varMA622 commonVARiantsMA622PRIORITY.vcf —variant:varMA625 commonVARiantsMA625PRIORITY.vcf —variant:varMA629 commonVARiantsMA629PRIORITY.vcf —variant:varNcm8 commonVARiantsNcm8PRIORITY.vcf -o unionVARiantsMAYODANmin3.vcf -minN 3

I used this combine variants method as I was not able to use select variants to pull the variants that were present in at least 3 sample. I used the following command. java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V unionVARiantsMAYODAN.vcf -select ‘set "minN 3";' -o commonVARiantsMAYODANminN3.vcf** which did not work. I know that the used of select **'set “minN 3” is not appropriate in here but not sure how I can do it. let me now if there is way.

But, I think the resulting *.vcf file using this trick (combine variants with minN 3) should be equivalent to the one if I had first merged the variants and then selected the ones present in at least 3 samples, rite??

Thanks in advance !

- Bishwa K.

From Geraldine_VdAuwera on 2015-12-11

@everestial007 Yes the method you used is correct and no further manipulation of the file should be necessary.

I believe SelectVariants has an option to drop non-essential fields of the vcf — have a look at the tool doc for a complete list.

From everestial007 on 2015-12-18

It is surprising to see that the ouput of the GenotypeGVCFs from several samples is often small compared to the g.vcf (for each sample we used). My several g.vcf files are above 2 gb but the genotypedGVCF from six samples is only 1 gb. Is this normal?

From Geraldine_VdAuwera on 2015-12-18

This is expected, as the gVCFs include data for many sites that are not called variant, whereas by default the output of GGVCFs only contains sites where a variant has been called. Also, the lines corresponding to each genome position from each gVCF get condensed into a single line in the final output.

From everestial007 on 2015-12-18

Thank you @Geraldine_VdAuwera !

From everestial007 on 2015-12-19

Geraldine_VdAuwera said: >

everestial007 Yes the method you used is correct and no further manipulation of the file should be necessary.

> I believe SelectVariants has an option to drop non-essential fields of the vcf — have a look at the tool doc for a complete list.

Hi @Geraldine_VdAuwera

After looking through the available flags my thought was that -IDs and -excludeIDs are the appropriate flags to select/exclude the field of interest. But, for some reason I am not finding a proper way of doing it. I created a fileKeep document and typed in the strings (as is in the vcf file) for the fields I am interested in. The script runs and outputs a file but with out any information on it except the headers (but there seems to be no selection). So, probably my selection of the flag is right but I am not able to find a right way of using it.

Please let me know if there is somewhere else I should be looking for.

Thanks,

From Geraldine_VdAuwera on 2015-12-19

Ah, my bad, it’s CombineVariants that has a -minimalVCF argument. And there’s a general engine capability called -sites_only that goes even further.

From everestial007 on 2015-12-19

@Geraldine_VdAuwera

I checked and found that -minimalVCF requires a boolean logic. I am not sure what value should I give. Providing true/false didnot help and i found no example to help myself. Also, I found no -sites_only argument for CombineVariants.

Could you please provide me with an example that could be helpful.

Thank you,

From Geraldine_VdAuwera on 2015-12-19

For booleans, just including the flag (without specifying a value) is sufficient to set it to True.

-sites_only is an engine capability so it’s documented with the rest of the engine arguments, not with any particular tool’s.

From everestial007 on 2015-12-22

> @Geraldine_VdAuwera said:

> For booleans, just including the flag (without specifying a value) is sufficient to set it to True.

> -sites_only is an engine capability so it’s documented with the rest of the engine arguments, not with any particular tool’s.

Thank you !!!

From mglclinical on 2016-01-11

nice post

From Guillefriis on 2016-04-11

Hi,

I'm trying to intersect a GATK called vcf file with other vcf called with a different genotyper (I've tried with samtools and TASSEL so far) to use the intersect as reliable SNP dataset to perform the BQSR since I work with a non-model avian genus and there is no other datasets available. I'm having a lot of problems, like incoherences with respect the reference genoe (even when I've used exactely the same one) or related with lacking the the vcf.idx (##### ERROR MESSAGE: Problem detecting index type) at the first step, when using th CombineVariants tool. Has somebody tried this before? Suggestions? Thanksd a lot.

Guillermo

From Geraldine_VdAuwera on 2016-04-11

@Guillefriis You’ll need to post the details of each problem separately, as we can’t give you a one-size-fits-all solution.

From Guillefriis on 2016-04-12

Hi @Geraldine_VdAuwera

Let's leave aside the intersection with the TASSEL calling dataset.

I performed a SNPcalling following the GATK best practices workflow till the BQSR step using Genotyping-by-sequencing (GBS) data for around four hundred individuals. Since I work with emberizids, I used the Zebra Finch genome as reference. I reproduced the GATK SNPcalling with strict quality thresholds and did an alternative SNPcalling using the mpileup tool from SAMTOOLS in order to recover those SNPs detected by both workflows and use them in the BQSR, assuming they would be more reliable. I'm using the script detailed here:

" module load GATK/3.3-0

gatk=/gpfs/resapps/GATK/3.3-0/GenomeAnalysisTK.jar genome=~/data/Borja/finchgenome/ZEFIgatkindex/Taeniopygiaguttata.taeGut3.2.4.dna.toplevel.fa samtoolssnps=~/scratch/PIPELINE/GBS/FinalWorkflow/Tsamtoolscalling/out_strict.vcf

java -jar $gatk \ -T CombineVariants \ -R $genome \ -V JUNCOsnpsZEFIHQintersectBQSR.vcf \ -V $samtoolssnps \ -o tassel_gatk.vcf

java -jar $gatk \ -T SelectVariants \ -V tasselgatk.vcf \ -R $genome \ -o IntersectZEFI_BQSR.vcf \ -nt 16 \ -select 'set == "Intersection";' "

However the CombineVariants tool gives an error somehow related with the index, my guess the lacking samtools snps vcf index:

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.3-0-g37228af):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: Problem detecting index type

ERROR ------------------------------------------------------------------------------------------

I couldn't find a way to produce and index for the file. I also checked related posts (http://gatkforums.broadinstitute.org/gatk/discussion/2283/in-regards-to-intersecting-vcf-files) but couldn't find a solution. Maybe is not doable? Thanks for your attention Geraldine.

Cheers Guillermo

From Sheila on 2016-04-12

@Guillefriis

Hi Guillermo,

I suspect the issue is with the non-GATK generated VCF. Can you try running [ValidateVariants](https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_ValidateVariants.php) on your VCFs? If they pass, you can try deleting the VCF indices and GATK will regenerate them for you.

-Sheila

From Sumudu on 2017-01-03

Hi,

I have a multi-sample (6 samples) vcf file and want to select sites present in at least 3 of them. Someone has asked the same question previously and the answer was to use SelectVariants. But I can’t still figure out how to use it. Should I use VariantContext or a script to do that. Appreciate if you would clarify it a bit more.

Thanks

-Sumu

From everestial007 on 2017-01-03

@Sumudu:

I have used `CombineVariants` to select variants present in at least 3 samples like this:

`java -Xmx4g -jar /home/everestial007/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T CombineVariants -R lyrata_genome.fa —variant:varSNPsMA605 filteredHCVariantsSNPsMA605.vcf —variant:varSNPsMA611 filteredHCVariantsSNPsMA611.vcf —variant:varSNPsMA622 filteredHCVariantsSNPsMA622.vcf —variant:varSNPsMA625 filteredHCVariantsSNPsMA625.vcf —variant:varSNPsMA629 filteredHCVariantsSNPsMA629.vcf —variant:varSNPsNcm8 filteredHCVariantsSNPsNcm8.vcf -o unionSNPsvariantMAYODANmin3.vcf -minN 3`.

I think the `-minN` 3 is an universal walker so just try it with `SelectVariants`. If not you will have to separate the samples and then combine it.

From Sumudu on 2017-01-04

Thanks @everestial007 . I have a multi sample vcf and I’m looking for a more convenient way of selecting variants without separating my samples in to individual vcfs. If any other suggestion would be appreciated.

Sumudu

From everestial007 on 2017-01-04

@Sumudu : I am not sure if you found the solution yet. But, I tested `—maxNOCALLnumber` with `SelectVariants` and it worked well. Unforutnately, `-minN` isn’t a universal walker and doesn’t work with `SelectVariants`.

Since, you want sites called at least in `3 samples` you can use it in following format:

`java -Xmx4g -jar /home/everestial007/G

enomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.vcf —maxNOCALLnumber 3 -o MY.phased_variants.minN3.vcf`

`—maxNOCALLnumber 3` – this option tells the walker to select the sites from the input `*.vcf` where maximum of 3 samples don’t have any genotype called. You can also use the fraction using `—maxNOCALLfraction`

Also, make sure to read all the possible option on the GATK tool for any task `-T` and try it out on small vcf samples.

From Sumudu on 2017-01-05

Thanks @everestial007 .

From Geraldine_VdAuwera on 2017-01-07

The `maxNOCALLnumber` approach will probably not work on a proper joint-called multisample VCF since any samples that are not variant but have enough data present at the sites of interest will have hom-ref calls. There are some decently simple ways to retrieve variants present at a certain frequency in the population, but embarrassingly we don’t have a straightforward way to simply select variants based on how many samples are variant or not variant. Which is not to say it can be done; there are several circuitous ways you could do it, including doing a first pass to filter out hom-ref genotypes (using VariantFiltration and the dreaded variant context), then a second to set these to no-calls (still with VariantFiltration, then finally use the `maxNOCALLnumber` trick proposed by @everestial007.

My recommendation? Use VariantsToTable to output a table of your variants with contig, position and genotypes, and use your scripting language of choice (heck, even Excel) to determine which you want to keep. Then simply produce an intervals list using the contig and position, and use that as `-L` value in SelectVariants. It’s cleaner and while you’re at it you can collect some preliminary stats on what your filtering is going to produce.

From Sumudu on 2017-01-09

@Geraldine_VdAuwera

Thank you for the detailed clear answer. Actually, I used SnpSift tool using the expression “countRef() < 4” to select variants present in at least 3 samples (Het or Homo Var) out of my 6 samples, in order to obtain a consistent variant set. I hope this method is also ok.

Sumudu.

From everestial007 on 2017-01-09

Geraldine_VdAuwera Sumudu

HI Geraldine,

Rather than using first pass method combined with other process, I think it would help – splitting the vcf file into multiple files by samples and then combining with them using `-minN` option.

Thanks,

From Sumudu on 2017-01-10

@everestial007,

Thanks. But what if you have >200 samples or so? Splitting them in to separate samples will be a tedious task.

Sumudu.

From everestial007 on 2017-01-10

@Sumudu

You have to realize that there are no perfect tools that do all. If you are not from informatics background, I sugggest you start looking into other forums and tools, which might have easier solution than GATK.

Solutions

#1:

For, splitting you vcf samples by samples check this:

https://www.biostars.org/p/224702/ – look for the answer by `WouterDeCoster` Using GNU parallels. Other answers are `python` based (I think).

After this you can `CombineVariants` from GATK with `-minN` option. Don’t use `uniquify` or `prioritize`.

I think with lots of sample you can create a file containing `list of vcf` and submit it in the GATK script, but I am not sure about it Geraldine_VdAuwera Sheila may be able to confirm.

#2:

Another useful tool would be

`VCFtools` http://vcftools.sourceforge.net/man_latest.html

Reading through the description, I think SITE FILTERING OPTIONS > Genotype value filtering > (—maxmissing or —maxmissingCount) would probably work.

#3:

Or, `BCFtools` https://samtools.github.io/bcftools/bcftools.html

Look into `bcftools isec` option

I have not personally verified the use of `vcftools` `bcftools` but they are pretty good tools for variants filtering. Another issue that you need to be careful is to look at the vcf file (or the regions within it) after filtering visually using `Vim Editior` on linux or `IGV`.

Also, you may apply all three methods: `CombineVariants` with GATK, `vcftools`, `bcftools` and see how well each of these tool/method has worked. It would be nice if you could let us know about your experience !

Thanks,

From Sumudu on 2017-01-11

@ everestial007,

Thanks for your answer. Let me go through what you have suggested and will let you know about my findings.

Sumudu.

From shlee on 2017-02-06

Hi everestial007 and Sumudu,

For many tool arguments, in GATK3 you can use the `-args`, aka `—arg_file ` option within the command. The arguments must be on the same line, e.g. `-V file1.vcf -V file2.vcf … -V fileN.vcf`. I just happened to test an arg_file last week with CombineVariants and it works well.

In GATK4, you will be able to provide a file listing the paths to each input file per line, e.g. `-V one-per-line-directory-paths.txt`.

I hope this is helpful.

From zejunyan on 2017-07-21

@shlee

Hi, thank you for you reply on population prior.

I got a new question:

I am trying to combine my callset with dbSNP in vcf format (public data from NCBI) using CombineVariants. two questions:

first, when combining, I have to make sure these callsets were produced based on the same reference genome, right? The problem is that these two callsets were produced using different version of chicken ref genomes, mine used version-5, the dbSNP used version-4.

Second, if I used the same ref genome as the dbSNP, the problem is that the #CHROM column in my callset does not match with that in dbSNP. For example, chromosome 22 in my callset is named ‘NC_006109.4’, but it is named as ‘22’ in the dbSNP.vcf. I guess this does matter, and it does not allow me to combine two vcf files. Right?

The solution to this could be that if I can change NC_006109.4 to 22 in my callset vcf file. Do you know a good approach to do this, possible?

Cheers

Dr yan

From shlee on 2017-07-21

Hi Dr. Yan (@zejunyan),

If the reference assembly coordinates shifted for the chicken genome between v4 and v5, then for the dbSNP resource file you can use something we call liftover. Lifting over coordinates is not an ideal thing to do but given the cost of regenerating the resource file (by realigning and calling on the new reference), we make do with liftover. Liftover requires a chain file and you can obtain these, e.g. from UCSC (Genome Browser), NCBI/GRC and ENSEMBLE. UCSC also offers a web-based liftOver application that is quite popular. Picard also offers LiftoverVcf as a tool. I suggest you compare the liftover results using different chains/functions and decide which is best for you. A month ago I actually inquired about acceptable rates of rejection and was told by an expert in the field that ~1% is the cutoff.

Lifting over will resolve differences in contig nomenclature. For GATK and Picard tools, contig names must match exactly.

Dr. Lee

From zejunyan on 2017-07-31

@shlee

Thank you very much!

I will give a try

Dr yan

From zejunyan on 2017-07-31

@shlee

Hi, shlee

I am trying this liftover with commandline: java -jar ../tools/picard-tools-1.141/picard.jar LiftoverVcf I=AVSEQ1300078.variants.g.vcf.gz O=lifted_over.vcf CHAIN=galGal4ToGalGal5.over.chain.gz REJECT=rejected_variants.vcf R=/mnt/Pella/Processing/NGS-working/zyan/galGal5/galGal5_TopLevel/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa

However, I did not see the chromosome coordinates in galgal4 changed to that in galgal5. In fact, all the records in my input vcf (ASEQ1300078.variants.g.vcf) are rejected into rejected.vcf, The lifted_over.vcf is empty.

I am nor sure where I input the cut-off value for the rate of rejection?

Cheers

Dr yan

From shlee on 2017-07-31

Hi @zejunyan,

It looks like you are using an older version of Picard. We are on [Picard v2.10.5](https://github.com/broadinstitute/picard/releases) currently. See how the latest version handles your data and also see how UCSC’s liftOver handles your data. The latter allows a cut-off value for the rate of rejection. Let me know how these runs go.

From zejunyan on 2017-07-31

@shlee

Hi, shlee,

When I tried to do BQSR using the snp database for chicken as the input for the knownSite, it always gave me this error message. Is there a way to fix this problem throughout this input vcf file?

ERROR MESSAGE: The provided VCF file is malformed at approximately line number 5466413: The VCF specification does not allow for whitespace in the INending field value was “RSPOS=67905532;GENEINFO=101749622:SERPINB10 CPOX|420895:SERPINB6;dbSNPBuildID=138;SAO=0;VC=snp;VLD;VP=0500000C0005000000000100”, foe: /gensys/projects/NGS/zyan/snp_db/00-All.sorted.vcf.gz**

Bests

Dr Yan

From shlee on 2017-07-31

@zejunyan,

Sounds like the known sites data file has some sort of formatting issue.

> ERROR MESSAGE: The provided VCF file is malformed at approximately line number 5466413

Look into the file to see what is causing the error and if this type of formatting issue is systematic. Then see if you can correct the formatting issue.

From zejunyan on 2017-09-06

@shlee

Hi, Shlee

I got a question about combining two variant sets: one is my callset, the other is the dbsnp from ensemble. Mt callset contains a total of 9,129,383 snps, the dbsnp database contains 20,829,354 snps. After combining, i had 22,135,071 snps instead of 29,958,737 (= 9,129,383 + 20,829,354). So, there is 7,823,666 snps missing in the combined VCF file.

Why is this?

Bests

Dr yan

From zejunyan on 2017-09-07

@shlee

I got another question:

I do hard filtering on my callsets, and I used these filters: QD, MQ, SOR, FS, MQRankSum and ReadPosRankSum, the threshold value for each filter was determined by density plots.

when I applied these filters one at a time, i.e filtered for 6 times, the number of snps in my callset was reduced from 1,317,911 to 612514. In each round of filtering, I extracted the snps with PASS, which will be used input for the following filtering.

However, when I applied all these filters in one goal using &&, there was no snps filtered out at all.

I think I should get the same number of snps through both ways.

So, why is this?

Bests

Dr yan

From Sheila on 2017-09-07

@zejunyan

Hi Dr. Yan,

The “missing” variants are probably the ones that are the same in both files. For example, I could have 5 variants in VCF1 and 5 variants in VCF2. If 3 variants are the same in both files, my combined VCF would have 7 variants instead of 10 variants. Does that make sense? In your case, the 7,823,666 variants are the same in both VCFs.

I think if you use || (or) instead of && (and) in your command, you will get the results you expect.

-Sheila

From zejunyan on 2017-09-08

@Sheila

Thank you very much!

Yes, this does make sense.

I got another question about Genotype Refinement Pipeline:

What I am trying to do is to find DeNovo mutations in my trio (11 samples in this trio) using this pipeline. So, my input VCF file for calculating genotype posteriors is the intersected SNPs (between my gatk callsets and dbsnp from ensemble), the following is my command line for this step:

[java -jar ../../../tools/GenomeAnalysisTK.jar -T CalculateGenotypePosteriors -R ../../../galGal5/galGal5_TopLevel/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa -V IntersectionSNP.vcf -ped ../trio.ped —supporting /gensys/projects/NGS/zyan/snp_db/ensemble_version/new_version.Gallus_gallus.vcf.gz -o output.vcf]

my .ped file looks like this:

FamilyID IndividualID PaternalID MaternalID Sex Phenotype

FAM001 2028N0019 0 0 other -9

FAM001 2028N0020 0 0 other -9

FAM001 2028N0021 0 0 other -9

FAM001 2028N0022 0 0 other -9

FAM001 2028N0023 0 0 other -9

FAM001 2028N0024 0 0 other -9

FAM001 2028N0026 0 0 other -9

FAM001 AVSEQ1300044 0 0 other -9

FAM001 AVSEQ1300045 0 0 other -9

FAM001 AVSEQ1300046 0 0 other -9

FAM001 AVSEQ1300047 0 0 other -9

for supporting file, I used the dbsnp from ensemble.

After calculation of genotype posterior:

I filtered lowGQ (<20) and did variant annotation as follow:

java -jar ../../../tools/GenomeAnalysisTK.jar -T VariantFiltration -R ../../../galGal5/galGal5_TopLevel/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa -V output.vcf —filterExpression “GQ < 20” —filterName “lowGQ” -o GQ_filter.vcf

java -jar ../../../tools/GenomeAnalysisTK.jar -T VariantAnnotator -R ../../../galGal5/galGal5_TopLevel/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa -V GQ_filter.vcf -A PossibleDeNovo -o GQ_filter.vcf

HOWEVER, I got no snps marked as DeNovo mutations.

I looked back at the GQ-filtered VCF file, for some variants, there are still GQ values in some samples of my trio having GQ < 20, although these variants are marked “PASS”. Am I supposed to have all samples having GQ >=20 after GQ filtering? Then, after annotating, I did not see any variants marked DeNovo mutations.

Why is this? Is there something wrong with my ped file or supporting file?

Bests

Thank you so much for your help

Dr yan

From Sheila on 2017-09-15

@zejunyan

Hi Dr. Yan,

What happens if you simply input your original VCF from the 11 samples instead of the intersection of dbSNP and your original VCF?

Thanks,

Sheila

From zejunyan on 2017-09-19

Hi, @Sheila

In the documentation of the VariantAnnotation step of Genotype Refinement Workflow, for High confidence de novo sites, it says that all trio sample GQs >=20 with the same AC/AF criterion,

What do you mean by ‘the same AC/AF criterion’? Do you mean that these High-confidence SNPs should also have AC<4 or AF < 0.1%, which is the same as the low-confidence SNPs?? Or, for High-confidence SNPs, they should have AC > 4???

What do you mean by AC < max(4, 0.1% sample)???

Could you tell me what is actually AC and how it is calculated in trio samples?? What is the difference between the AC within a single genotyped sample and the AC in genotyped trio samples??

Bests

Dr Yan

From Kumar on 2017-09-20

Hi, I have 13 samples, 6 belongs to group1 and other 6 represents group2 and one sample is an outlier. I called variants using Haplotype caller > gVCFs > GenotypedGVCFs. I then merged all the samples together using combineVariants tool. Now I have a VCF file with all samples in one file. I then wanted to fetch all the variants unique to group1. So I used SelectVariant tool with the expression set=“sample1-sample2-sample3-sample4-sample5-sample6”. This execution did not give me any variants in the resulting file. I am wondering if the set expression is wrong or there is some other issue?

Any suggestions? How can I check if the VCF file is malformed? its looking OK.

ks575@minerva:~$ java -Xmx10g -jar /home2/ISAD/ks575/software/gatk/GenomeAnalysisTK.jar -R ../../../G006b_filled.fa -T SelectVariants -V:variant Sus_Res.gt.snp.filter.vcf -select ‘set==“LIB15045-LIB18337-LIB18338-LIB18339-LIB18340-LIB18341”’ -o Sus_specific_variant.vcf

INFO 16:50:39,368 HelpFormatter – ————————————————————————————————————————

INFO 16:50:39,373 HelpFormatter – The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29

INFO 16:50:39,375 HelpFormatter – For support and documentation go to https://www.broadinstitute.org/gatk

INFO 16:50:39,375 HelpFormatter – [Wed Sep 20 16:50:39 BST 2017] Executing on Linux 3.13.0-119-generic amd64

INFO 16:50:39,376 HelpFormatter – Java HotSpot™ 64-Bit Server VM 1.8.0_91-b14 JdkDeflater

INFO 16:50:39,385 HelpFormatter – Program Args: -R ../../../G006b_filled.fa -T SelectVariants -V:variant Sus_Res.gt.snp.filter.vcf -select set==“LIB15045-LIB18337-LIB18338-LIB18339-LIB18340-LIB18341” -o Sus_specific_variant.vcf

INFO 16:50:39,394 HelpFormatter – Executing as ks575@minerva on Linux 3.13.0-119-generic amd64; Java HotSpot™ 64-Bit Server VM 1.8.0_91-b14.

INFO 16:50:39,395 HelpFormatter – Date/Time: 2017/09/20 16:50:39

INFO 16:50:39,396 HelpFormatter – ————————————————————————————————————————

INFO 16:50:39,435 GenomeAnalysisEngine – Strictness is SILENT

INFO 16:50:43,304 GenomeAnalysisEngine – Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000

INFO 16:50:47,539 GenomeAnalysisEngine – Preparing for traversal

INFO 16:50:47,548 GenomeAnalysisEngine – Done preparing for traversal

INFO 16:50:47,549 ProgressMeter – [INITIALIZATION COMPLETE; STARTING PROCESSING]

INFO 16:51:17,561 ProgressMeter – scaffold_445:243628 1235714.0 30.0 s 24.0 s 71.0% 42.0 s 12.0 s

INFO 16:51:30,274 SelectVariants – 1917799 records processed.

INFO 16:51:30,320 ProgressMeter – done 1917799.0 42.0 s 22.0 s 100.0% 42.0 s 0.0 s

INFO 16:51:30,321 ProgressMeter – Total runtime 42.77 secs, 0.71 min, 0.01 hours

—————————————————————————————————————————————

Done. There were no warn messages.

—————————————————————————————————————————————

From Sheila on 2017-09-22

@zejunyan

Hi Dr. Yan,

The same AC/AF criterion means AC < 4 or AF < 0.1%. The low confidence and high confidence de novos both have the same AC/AF criterion.

The criteria basically ensures the variant allele is not common in the population/family. That is also the reason we take “whichever is more stringent for the number of samples in the dataset”. You can have a look at the VCF header for more information on what AC and AN are. They are two ways of showing how common the variant allele is in the VCF.

-Sheila

From Sheila on 2017-09-25

@Kumar

Hi,

So, did you create a final VCF for each sample? The best thing to do is run GenotypeGVCFs on all the GVCFs together. Have a look at [this article](https://software.broadinstitute.org/gatk/documentation/article.php?id=4150).

To determine the variants unique to each set/sample, the best thing to do is use JEXL and for loops. Have a look at [this document](https://gatkforums.broadinstitute.org/gatk/discussion/1255/using-jexl-to-apply-hard-filters-or-select-variants-based-on-annotation-values) for more information (specifically, under Using VariantContext directly).

-Sheila

P.S. Have a look at [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/8794/selecting-intersecting-variants) and [this article](https://gatkforums.broadinstitute.org/gatk/discussion/54/selecting-variants-of-interest-from-a-callset). If you search the forum, you should find many other relevant threads too.

From Kumar on 2017-09-26

@Sheila

Thanks for the reply.

Yes I created the final VCF by running GenotypeGVCFs (group1 with all merged gVCFs and separately on group2 with all merged remaining samples). I tried bcftools isec function to get the unique set of SNPs from each individual groups. I think it worked.

I have not tried the one you have mentioned above (jEXL and for loops). I will try that now. Thanks for the links.

From prepagam on 2018-02-21

I’m called variants in 10 samples, using single sample calling (i.e. called seperately in each animal) from a gvcf with haplotype caller, and emited both variant and invariant sites. I then filtered this file, by triming the alternates, and selecting AN=2.

I’m now trying to merge those 10 filtered vcfs into a single file.

I am getting this error:

ERROR MESSAGE: CombineVariants should not be used to merge gVCFs produced by the HaplotypeCaller; use CombineGVCFs instead

I am confused because I am combining 10 vcfs (and i checked and they are vcfs).

java -jar -Xmx4G ${GATK} \

-T CombineVariants \

-R ${REFERENCE} \

-V MYINFILES \

-o OUTFILE

This is gatk37, I tried gatk38, that didn’t help.

Its important the merged file has both variant and nonvariant sites, and I have to call seperately.

From Sheila on 2018-02-27

@prepagam

Hi,

I think this may be a case of CombineVariants not being able to handle non-variant sites. Is there a reason you cannot use GenotypeGVCFs on all GVCFs?

-Sheila

Report abuse