howto Run the genotype refinement workflow

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2014-10-18

Overview

This tutorial describes step-by-step instruction for applying the Genotype Refinement workflow (described in this method article) to your data.

Step 1: Derive posterior probabilities of genotypes

In this first step, we are deriving the posteriors of genotype calls in our callset, recalibratedVariants.vcf, which just came out of the VQSR filtering step; it contains among other samples a trio of individuals (mother, father and child) whose family structure is described in the pedigree file trio.ped (which you need to supply). To do this, we are using the most comprehensive set of high confidence SNPs available to us, a set of sites from Phase 3 of the 1000 Genomes project (available in our resource bundle), which we pass via the --supporting argument.

java -jar GenomeAnalysisToolkit.jar -R human_g1k_v37_decoy.fasta -T CalculateGenotypePosteriors --supporting 1000G_phase3_v4_20130502.sites.vcf -ped trio.ped -V recalibratedVariants.vcf -o recalibratedVariants.postCGP.vcf

This produces the output file recalibratedVariants.postCGP.vcf, in which the posteriors have been annotated wherever possible.

Step 2: Filter low quality genotypes

In this second, very simple step, we are tagging low quality genotypes so we know not to use them in our downstream analyses. We use Q20 as threshold for quality, which means that any passing genotype has a 99% chance of being correct.

java -jar $GATKjar -T VariantFiltration -R $bundlePath/b37/human_g1k_v37_decoy.fasta -V recalibratedVariants.postCGP.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibratedVariants.postCGP.Gfiltered.vcf

Note that in the resulting VCF, the genotypes that failed the filter are still present, but they are tagged lowGQ with the FT tag of the FORMAT field.

Step 3: Annotate possible de novo mutations

In this third and final step, we tag variants for which at least one family in the callset shows evidence of a de novo mutation based on the genotypes of the family members.

java -jar $GATKjar -T VariantAnnotator -R $bundlePath/b37/human_g1k_v37_decoy.fasta -V recalibratedVariants.postCGP.Gfiltered.vcf -A PossibleDeNovo -ped trio.ped -o recalibratedVariants.postCGP.Gfiltered.deNovos.vcf

The annotation output will include a list of the children with possible de novo mutations, classified as either high or low confidence.

See section 3 of the method article for a complete description of annotation outputs and section 4 for an example of a call and the interpretation of the annotation values.

Updated on 2015-11-26

From Gustav on 2014-10-28

Hello,

I have been unable to locate “ALL.phase3.20130502.biallelic_snps.integrated.sites.vcf “ on 1000G site and Broads ftp. Where can I actually find or how can it be generated?

Thank you.

From Geraldine_VdAuwera on 2014-10-28

Hi @Gustav,

We don’t currently provide that file, but we are planning to include it in our resource bundle in the near future. In the meantime, you can generate it from the data that is publicly available from the 1000Genomes project website.

From estif74 on 2014-11-16

Hi there Geraldine, had 2 comments/questions about this:

1. In step 2 of the above, should the variant file (-V) as an input be the same as the output file (-o) of step 1? The input file in the second step is listed as “C1643.PbyT.CGP.vcf” but the output file of the first step is listed as “recalibratedVariants.postCGP.vcf”

2. Was just wondering the rationale for filtering on GQ < 20 in step 2? If you’ve passed your VCF through VQSR, would it be OK to just filter (using SelectVariants) for those that passed the filter (-ef)?

Thanks for your help as always!

From Geraldine_VdAuwera on 2014-11-17

Hi @estif74,

1. I think that’s a copy/paste error; I’ll check and fix if it is.

2. VQSR only tells you which variant sites are ok; whereas filtering on GQ tells you, within the pool of good sites, which sample genotypes are potentially unreliable. You can have a good variant, where you know you have a sample that’s not hom-ref, but where you don’t know if the sample is het or hom-var. Good variant, bad genotype. Two different levels of filtering.

From estif74 on 2014-11-21

Got it, that’s very helpful. Thanks for the explanation as always!

From albertyu on 2015-01-08

Hello, I am wondering why you use 1000 genome phase 3 SNP vcf files in step 1? And why do you use 1000G_phase1.snps.high_confidence.b37.vcf for VariantRecalibrator in the Best practice?

After I download vcf files from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ (ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz) and merged them into one file, the size of this merged file is really huge. Are these files the correct files I should use for step 1?

Thank you.

From Sheila on 2015-01-12

@albertyu‌

Hi,

The only reason the phases are different between the two files are because there were different phases when each of the documents were written. This document for genotype refinement is newer than the vqsr document.

The files you are using from the 1000genomes ftp have the genotypes included, but you can just use the sites_only files in our bundle which are much smaller. You do not need the genotypes since VQSR does not use them.

-Sheila

From terestahl on 2015-01-17

Hello

After reading these comments it is still not clear to me which file could I use for step 1, to “Derive posterior probabilities of genotypes”. Could this file be OK:

ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf

(from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ )

Otherwise could you please clarify how to prepare the required file

Thank you!

From simonsanchezj on 2015-01-19

Dear Geraldine, I have exome data for ~300 unrelated individuals. I have called variants following GATK’s best practices and I am wondering whether genotype refinement is necessary and if so, whether I need to calculate genotype posteriors using any other database. Thanks a lot!

From Geraldine_VdAuwera on 2015-01-19

@terestahl That looks ok based on the name; though you may need to subset just biallelic snps from that file.

@simonsanchezj‌ Genotype refinement is not required for variant discovery, but it may be useful if your research project involves using genotype information (e.g. for de novo mutation discovery).

From terestahl on 2015-01-20

Hello Geraldine, Thank you for your answer! I extracted the biallelic SNPs (from the file ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf) and passed it with —supporting to derive posterior probabilities of genotypes. After that I filtered low quality genotypes (-G_filter “GQ < 20.0”). The output file is annotated with lowGQ or PASS. However not all variants get this annotation. I then wonder if the file looks OK. Thanks a lot. /Teresa

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 101 P1148_102 ….. 1 65745 . A G 92.55 VQSRTrancheSNP99.90to100.00 […..] GT:AD:DP:FT:GQ:PL:PP 0/0:5,0:5:lowGQ:7:0,0,112:0,7,131 0/0:4,0:4:lowGQ:19:0,12,147:0,19,166 […..] 1 69511 rs75062661 A G 47644.66 VQSRTrancheSNP99.90to100.00 […..] GT:AD:DP:GQ:PGT:PID:PL:PP 1/1:0,75:75:99:.:.:2157,225,0:2187,239,0 ./.:1,0:1 […..] 1 721290 rs12565286 G C 970.96 PASS […..] GT:AD:DP:GQ:PL:PP 0/1:8,3:11:65:76,0,263:65,0,280 0/0:12,0:12:44:0,33,495:0,44,523 […..] 1 721450 . G A 1036.89 VQSRTrancheSNP99.90to100.00 […..] GT:AD:DP:GQ:PL:PP 0/1:100,17:117:99:180,0,3000:172,0,3014 0/0:35,0:35:99:0,99,1485:0,107,1507 […..] 1 752894 rs3131971 T C 10644.86 PASS […..] GT:AD:DP:FT:GQ:PL:PP 0/1:5,5:10:PASS:99:138,0,157:146,0,155 1/1:0,11:11:PASS:35:348,33,0:358,35,0 […..] 1 752928 . C T 50.23 PASS […..] GT:AD:DP:GQ:PL:PP 0/0:16,0:16:77:0,48,583:0,77,646 0/0:13,0:13:68:0,39,473:0,68,536 […..] 1 762109 . C T 4807.89 VQSRTrancheSNP99.90to100.00 […..] GT:AD:DP:GQ:PL:PP 0/1:251,52:303:99:842,0,7599:834,0,7613 0/0:45,0:45:99:0,118,1771:0,126,1793 […..] 1 762273 rs3115849 G A 133553.81 VQSRTrancheSNP99.90to100.00 […..] GT:AD:DP:GQ:PL:PP 0/1:95,121:216:99:3583,0,2621:3590,0,2620 1/1:0,143:143:99:5091,430,0:5100,431,0 […..] 1 762366 . G C 612.23 PASS […..] GT:AD:DP:GQ:PL:PP 0/0:45,0:45:99:0,108,1716:0,137,1779 0/0:36,0:36:99:0,99,1485:0,128,1548 […..] 1 762472 rs145493205 C T 1584.69 VQSRTrancheSNP99.90to100.00 […..] GT:AD:DP:FT:GQ:PGT:PID:PL:PP 0/0:51,0:51:lowGQ:7:.:.:0,2,1318:0,7,1334 0/0:36,0:36:PASS:99:.:.:0,99,1485:0,104,1501 […..] 1 762485 rs148989274 C A 1516.82 PASS […..] GT:AD:DP:FT:GQ:PGT:PID:PL:PP 0/0:48,0:48:PASS:34:.:.:0,30,1508:0,34,1522 0/0:36,0:36:PASS:99:.:.:0,99,1485:0,103,1499 […..] 1 762589 rs71507461 G C 22684.91 VQSRTrancheSNP99.90to100.00 […..] GT:AD:DP:FT:GQ:PGT:PID:PL:PP 0/1:5,22:27:PASS:99:0|1:762589_G_C:888,0,250:896,0,248 1/1:0,19:19:PASS:59:1|1:762589_G_C:822,57,0:832,59,0 […..] 1 762592 rs71507462 C G 22490.93 PASS […..] GT:AD:DP:FT:GQ:PGT:PID:PL:PP 0/1:5,21:26:PASS:99:0|1:762589_G_C:888,0,250:896,0,248 1/1:0,17:17:PASS:59:1|1:762589_G_C:822,57,0:832,59,0 […..]

From simonsanchezj on 2015-01-21

Dear Geraldine, thanks a lot for your reply. For this project I have exomed ~300 unrelated individuals for a case-control study. Thus, I don’t intend to perform de novo mutation discovery. Is there anyway I can benefit from the genotype refinement workflow? Worth it? If so, which collection of VCFs should I use for informing allele frequency priors?

Finally, in case i decide not to do this, whould you still recommend to filter low quality genotypes as suggested in Step 2?

Thanks a lot for your help.

From Geraldine_VdAuwera on 2015-01-21

@terestahl I think I answered someone else with the exact same question recently — in a nutshell, not all genotypes are “eligible” to be evaluated by the GQ filter. Some may be skipped if they do not fulfill the necessary conditions, and therefore will not get the FT annotation.

From Sheila on 2015-01-21

@simonsanchezj‌

Hi,

Unfortunately, we cannot decide whether the genotype refinement workflow is appropriate for your analysis or not. It depends if you care at all about being able to distinguish hets from hom-vars. There are other reasons for caring about this besides looking for de novos, and it is up to you to decide. If you only care about identifying variant sites, VQSR should be sufficient.

Good luck.

-Sheila

From simonsanchezj on 2015-01-21

Thanks for your reply, Sheila. I do care about distinguishing hets from hom-vars. I think I understand now how the refinement workflow works. Which collection of VCFs should I use for informing allele frequency priors?

On a different matter, I want to calculate genotype posteriors in some families. I have read somewhere in this forum that the input .ped file should only contain trios. Thus, for each child, I created a dummy family containing both parents and that person. However, when I run CalculateGenotypePosteriors, I get the following error: No PED file passed or no non-skipped trios found in PED file. Skipping family priors.

What am I doing wrong?

Mi command line looks like:

java -Xmx”$MEM“g -jar “$GATK” \

-R “$REFERENCE” \

-T CalculateGenotypePosteriors \

-V “$INPUT” \

—skipPopulationPriors

-ped “$PED“

-o “$OUTPUT”

My ‘dummy’ ped looks like

FAM1 sample1 founder1 founder2 2 2

FAM1 founder1 0 0 1 1

FAM1 founder2 0 0 2 1

FAM2 sample2 founder1 founder2 2 2

FAM2 founder1 0 0 1 1

FAM2 founder2 0 0 2 1

FAM3 sample3 founder1 founder2 1 1

FAM3 founder1 0 0 1 1

FAM3 founder2 0 0 2 1

From Geraldine_VdAuwera on 2015-01-26

What do you mean by “created a dummy family containing both parents and that person”? Are they part of the same family or not?

From simonsanchezj on 2015-01-27

Hi Geraldine,

Sheila already answered the ped-related question in another post. Thank you both for your help.

As for population-based analyses, which collection of VCFs should I use for informing allele frequency priors?

Thanks

From Sheila on 2015-01-28

@simonsanchezj‌

Hi,

You can use the Phase 3 1000 Genomes biallelic SNPs. Please see Geraldine’s response to Gustav in this thread.

-Sheila

From simonsanchezj on 2015-02-02

Dear Sheila, thanks a lot for your reply. Are you planning to include ALL.phase3.20130502.biallelic_snps.integrated.sites.vcf in the 2.8 bundle in the near future? Otherwise, would you recommend using vcf’s contained here? ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/

Thanks a lot for your collaboration in this and other matters.

From Geraldine_VdAuwera on 2015-02-02

We are planning to release a new version of the data bundle with the next version (3.4). It will probably be a few more weeks before we can talk about release schedule though.

From yoskeri on 2015-03-28

Hi, I also have trouble in finding supporting file for CalculateGenotypePosteriors walker. I’ve found biallelic_snp.vcf file at 1000genome ftp server ;

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/site_assessment/ Here, for example

ALL.chr1.unfiltered_union_sites_with_svm.20130502.biallelic_snps.sites.vcf.gz is available. But I have noticed that this directory doesn’t have biallelic_snp.vcf file for chrY.

It involves biallelic_snp.vcf file for other chromosomes.

Are these files adequate “supporting” files ?

From Geraldine_VdAuwera on 2015-03-30

Yes, that sounds right. Not sure why chrY is not represented.

From yoskeri on 2015-04-01

> @Geraldine_VdAuwera

Thank you for your answer. I’ll try merging these files and if in trouble again, post here again.

From jh7521 on 2015-06-19

> @simonsanchezj said:

> Hi Geraldine,

> Sheila already answered the ped-related question in another post. Thank you both for your help.

> As for population-based analyses, which collection of VCFs should I use for informing allele frequency priors?

> Thanks

Hi~

Can you tell me about Sheila’s answer for the ped-related question?

That’s what I want to know.

Many thanks in advance.

From Sheila on 2015-07-01

@jh7521

Hi,

I believe @simonsanchezj is referring to this thread: http://gatkforums.broadinstitute.org/discussion/5068/ped-file-structure-for-calculategenotypeposteriors-walker

I hope this helps.

-Sheila

From rose_ismet on 2015-07-08

Hi,

I've been trying to prepare the ALL.phase3.20130502.biallelicsnps.integrated.sites.vcf file based on the recommendations given in this thread in order to get Step 1 running. After downloading ALL.wgs.phase3shapeit2mvncallintegrated_v5a.20130502.sites.vcf from the 1000G ftp site, I believe my next step is to select only the biallelic site. I tried this out using the SleectVariants walker. However I receive an error as below:

promise@eden:/media/GenomeAnalysisTK-3.4-0$ java -jar GenomeAnalysisTK.jar -T SelectVariants -R /media/GATK\ Bundle/GRCh37/humang1kv37.fasta -V /media/GATK\ Bundle/ALL.phase3.20130502.biallelicsnps.integrated.sites/1000Gftp/ALL.wgs.phase3shapeit2mvncallintegratedv5a.20130502.sites.vcf -o /media/GATK\ Bundle/ALL.phase3.20130502.biallelicsnps.integrated.sites/ALL.phase3.20130502.biallelicsnps.integrated.sites.vcf -selectType SNP -selectType MNP -restrictAllelesTo BIALLELIC ............................... ............................... INFO 10:43:36,934 ProgressMeter - 21:32804795 6.75401394E8 34.5 m 3.0 s 90.7% 38.0 m 3.5 m INFO 10:44:06,936 ProgressMeter - 22:38919126 6.8447717E8 35.0 m 3.0 s 92.5% 37.8 m 2.8 m WARN 10:44:26,933 RestStorageService - Error Response: PUT '/ZfGg9SNeuIpdIumzexlHAQZig40nfm0z.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 1061, Content-MD5: 201Aq8dwhl0j5fJGHKRP0Q==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: db4d40abc770865d23e5f2461ca44fd1, Date: Wed, 08 Jul 2015 02:44:25 GMT, Authorization: AWS AKIAI22FBBJ37D5X62OQ:9h2p7rrbmckoIZny2zI+E3bpGAI=, User-Agent: JetS3t/0.8.1 (Linux/3.11.0-26-generic; amd64; en; JVM 1.7.0_55), Host: broad.gsa.gatk.run.reports.s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: 26A9B7914689A8F5, x-amz-id-2: WS4wp1NQct07M27rc8EUappYwQ9OF0RMJmEtPJl4pbeQZBgL71Ts3pTS0j/h9vmGVFALomFrCG0=, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Wed, 08 Jul 2015 02:25:44 GMT, Connection: close, Server: AmazonS3] WARN 10:44:27,927 RestStorageService - Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately -1122 seconds. Retrying connection. INFO 10:44:29,256 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

java.lang.IllegalStateException: Key OLD_VARIANT found in VariantContext field INFO at X:7151130 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default. at htsjdk.variant.vcf.VCFEncoder.fieldIsMissingFromHeaderError(VCFEncoder.java:176) at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:115) at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:222) at org.broadinstitute.gatk.engine.io.storage.VariantContextWriterStorage.add(VariantContextWriterStorage.java:182) at org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub.add(VariantContextWriterStub.java:271) at org.broadinstitute.gatk.tools.walkers.variantutils.SelectVariants.map(SelectVariants.java:774) at org.broadinstitute.gatk.tools.walkers.variantutils.SelectVariants.map(SelectVariants.java:288) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.4-0-g7e26428):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Key OLD_VARIANT found in VariantContext field INFO at X:7151130 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.

ERROR ------------------------------------------------------------------------------------------

Please help! Am I doing this right?

From Sheila on 2015-07-08

@rose_ismet

Hi,

This could be a bug. Have a look at this thread for more information: http://gatkforums.broadinstitute.org/discussion/3962/genotypeandvalidate-error-key-callstatus-found-in-variantcontext-field-info

Can you please post the vcf record at position X:7151130?

Thanks,

Sheila

From rose_ismet on 2015-07-09

Hi Shiela,

Thank you for the promp reply. I’ll try to read through the forum to gather more info on this issue.

At the meantime, here’s the vcf record for the position requested.

X 7151130 . TT TC 100 PASS AC=1;AF=0.000264901;AN=3775;NS=2504;OLD_VARIANT=X:7151131:TC/CC/C;DP=13746;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_AF=0;EAS_AF=0.001;VT=MNP

Anything out of the ordinary there?

From MUHAMMADSOHAILRAZA on 2015-07-09

Hi, My VCF file contain both INDELs and SNPs, Is that genotype refinement can only be applied on SNPs but not INDELs present in the file?

I am sorry for my stupid question.. But i am confused. because —supporting file (ALL.phase3.20130502.biallelic_snps.integrated.sites.vcf) contain snps.

Thanks!

From Sheila on 2015-07-09

@rose_ismet Hi,

I am not sure why you are having this error if you downloaded from the 1000Genomes website. I just downloaded the X chromosome vcf and it has OLD_VARIANT defined in the header.

The reason you are getting the error is because OLD_VARIANT is not defined in the vcf header. You can simply add in a line to define it yourself, and that should solve the problem.

-Sheila

P.S. The header line from the vcf I downloaded is ##INFO=

From Sheila on 2015-07-09

@MUHAMMADSOHAILRAZA

Hi,

Yes, as of now, the genotype refinement workflow works on SNPs only.

You don’t need to select for SNPs only. You can simply input your recalibrated vcf with everything.

-Sheila

From MUHAMMADSOHAILRAZA on 2015-07-10

Sheila thank you…!

From MUHAMMADSOHAILRAZA on 2015-07-13

Hi,

After following all the steps i finally got the “recalibratedVariants.postCGP.Gfiltered.deNovos.vcf” file. Is there any way in GATK that i can extract and filter only “hiConfDeNovo” (high confidence De novo mutation)and “loConfDeNovo” (low confidence de novo mutations) mutations?

i just utilized the grep -w “hiConfDeNovo” recalibratedVariants.postCGP.Gfiltered.deNovos.vcf >> hiConfDeNovo.vcf to extract the corresponding records, is there another good way?

Thanks!

From MUHAMMADSOHAILRAZA on 2015-07-13

My another question is:

If we Phased the VCF file by PhaseByTransmission (i.e. after step 2, before using GenotypeAnnotator), how the tool effect on Genotypes information, and is that affect lator de novo mutation discovery in step 3 ? and if we utilize other GATK tools for downstream analysis, are they automatically recognize updated GQ values and ignore PLs?

From MUHAMMADSOHAILRAZA on 2015-07-14

Hi Sheila Geraldine_VdAuwera

I am waiting for your kind reply…..

From Sheila on 2015-07-14

@MUHAMMADSOHAILRAZA

Hi,

For your first question, you can use Select Variants. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php

For your second question, this thread should help: http://gatkforums.broadinstitute.org/discussion/5573/some-queries-on-phasebytransmission-output

The GQ field is updated in the output VCF, so it will be recognized by downstream tools. Because the PLs are not changed, the downstream tools will use the original PLs. As for ignoring PLs, it depends on which tools you use. If the PLs are taken into account, the original PLs will be used.

I hope this helps to clarify things.

-Sheila

From MUHAMMADSOHAILRAZA on 2015-07-15

@Sheila

Do PhaseByTransmission tool use GQ info?

From rose_ismet on 2015-07-22

Hi Shiela,

Thanks for the tips. Finally managed to get the mising header line into the file that I have. Also managed to seperate out the biallelic SNPs.

Now my issue is that I have used the hg19 as a referance. And i believe that the SNPs from Phase 3 of the 1000 Genomes project are of a GRCh build. I can’t seem to be able to insert ‘chr’ to the chromosomes. I wonder if GATK provide this referance in hg19 format. or would there be any other alternative for analysis done using hg19 to derive posterior probabilities?

Coming from a biological background you can imagine the dilleme I am having.

Thank you for your time.

From Sheila on 2015-07-24

@MUHAMMADSOHAILRAZA

Hi,

Sorry for the late response! Yes. Phase By Transmission uses GQ.

-Sheila

From Sheila on 2015-07-24

@rose_ismet

Hi,

You can use Liftover Variants to fix your problem. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_LiftoverVariants.php

-Sheila

From nchuang on 2015-07-29

@Sheila

I also am getting the OLD_VARIANT error. It is strange because running SelectVariants on dbsnp142 and 1kG_phase3.snps vcfs I didn’t get this error. I only got it trying to use CombineVariants on them.

I found the entry, it was in the 1kG_phase3.snps.b37.vcf. I think I got that from the resource bundle. I will try inserting the header.

Nelson

From Sheila on 2015-07-31

@nchuang

Hi Nelson,

You can either add the OLD_VARIANT to the header, or check again in the resource bundle for 1000G_phase3_v4_20130502.sites.vcf. The file was recently added and should work properly.

-Sheila

From ahmed_chakroun on 2015-11-22

Hi,

Just to point out a mistake on the Step 2 output file description. Indeed, at the end of the Step 2 your can read “Note that in the resulting VCF, the genotypes that failed the filter are still present, but they are tagged lowGQ in the FILTER field.”. However, -G_filter through “VariantFiltration will add the sample-level FT tag to the FORMAT field of filtered samples (this does not affect the record’s FILTER tag).”

Cheers

Ahmed

From Sheila on 2015-11-23

@ahmed_chakroun

Hi Ahmed,

I am afraid I don’t understand your post. What exactly needs to be changed in the document? Posting some before and after records might help too.

-Sheila

From Geraldine_VdAuwera on 2015-11-26

Thanks for reporting this, @ahmed_chakroun. You are correct, we meant the “genotype filter field” which of course corresponds to the FT tag in FORMAT, not the site-level FILTER field. I’ll fix that now.

From gkphilip on 2016-02-18

Hello,

I’m trying to lift over the 1000G_phase3_v4_20130502.sites.vcf from the resource bundle to hg19 with Picard 2.1.0

using the command: java -jar /usr/local/picard/2.1.0/lib/picard.jar LiftoverVcf I=1000G_phase3_v4_20130502.sites.vcf O=1000G_phase3_v4_20130502.sites.hg19.vcf CHAIN=b37tohg19.chain REJECT=liftover.rejected_variants.vcf R=hg19/ucsc.hg19.fasta.

I get the following error:

[Thu Feb 18 12:28:24 EST 2016] picard.vcf.LiftoverVcf INPUT=1000G_phase3_v4_20130502.sites.vcf OUTPUT=1000G_phase3_v4_20130502.sites.hg19.vcf CHAIN=b37tohg19.chain REJECT=liftover.rejected_variants.vcf REFERENCE_SEQUENCE=hg19/ucsc.hg19.fasta WARN_ON_MISSING_CONTIG=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json

[Thu Feb 18 12:28:24 EST 2016] Executing as gkphilip@merri on Linux 2.6.32-573.8.1.el6.x86_64 amd64; Java HotSpot™ 64-Bit Server VM 1.8.0_20-b26; Picard version: 2.1.0(25ebc07f7fbaa7c1a4a8e6c130c88c1d10681802_1454776546) IntelDeflater

INFO 2016-02-18 12:28:24 LiftoverVcf Loading up the target reference genome.

INFO 2016-02-18 12:28:40 LiftoverVcf Lifting variants over and sorting.

ERROR 2016-02-18 12:28:40 LiftoverVcf Encountered a contig, chr1 that is not part of the target reference.

[Thu Feb 18 12:28:40 EST 2016] picard.vcf.LiftoverVcf done. Elapsed time: 0.26 minutes.

Runtime.totalMemory()=5031067648

To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

I have checked that chr1 is in the target reference (it was downloaded from the resource bundle), and have tried searching the forum for similar issues but don’t seem to find any. Is it possible to liftover this file, or do you have any other suggestions on how I can overcome this issue?

Many thanks for your help.

From Sheila on 2016-02-23

@gkphilip

Hi,

I think your question is being answered [here](http://gatkforums.broadinstitute.org/gatk/discussion/7009/picard-liftovervcf-contig-not-part-of-the-target-reference/p1).

-Sheila

From hujingchu on 2016-08-04

hi,

I have a trio(parents and a child) exome data which is going to call de novo variants, what i have done is following the best practices to get variants, during the filtering step, SNPs were using a VQSR, but Indels were using a hard filter since insufficient amount,.

before running CalculateGenotypePosteriors, both VQSR snps and hard filter indels were cat together.

the question is, after CalculateGenotypePosteriors, snps were all removed from output without any warnings, everything is ok if only run on SNPs set.

From Sheila on 2016-08-04

@hujingchu

Hi,

What do you mean “after CalculateGenotypePosteriors, snps were all removed from output without any warnings”? Can you post the exact command you ran?

Thanks,

Sheila

From helene on 2017-03-21

Hello,

I found that the 1000G_phase3_v4_20130502.sites.vcf file under b37 folder in your bundle. No such file in hg19 folder. Since my reference genome has been hg19, does that mean I would need to realign my files? Thank you.

From shlee on 2017-03-24

Hi @helene,

I believe our resource bundles are provided as is. If the hg19 folder doesn’t contain the resource, then please check with the 1000 Genomes Project to see if they have the equivalent.

From zejunyan on 2017-04-19

Hi all,

I am trying to run a trio containing 3 samples (dad, mum and one child) with the aim to identify mutations in child. I managed to get the final vcf file: recalibratedVariants.postCGP.Gfiltered.deNovos.vcf. Based on the statement that high confidence de novo sites have all trio sample GQs >= 20 with the same AC/AF criterion, I filtered out these callsets with GQ >=20 in each trio sample. However, there is still more than 0.5 million snps as potential mutations, which cannot be true. I am looking for less than 100 snps as mutations.

Am I doing something wrong? Could anyone give me some suggestions?

Bests

Dr yan

From Sheila on 2017-04-20

@zejunyan

Hi Dr. Yan,

Are you seeing 0.5 million high confidence de novo mutations when you expect to see ~100? If so, can you please post the exact commands you ran to produce the final VCF with de novos?

Thanks,

Sheila

From Geraldine_VdAuwera on 2017-04-20

Would also be good to check for sample swaps, and verify paternity…

From zejunyan on 2017-04-21

@Sheila

Yes, I see a lot more than expected. The command-lines are :

java -jar ../../../../../zyan/tools/GenomeAnalysisTK.jar -R ../../../../../zyan/pdata/testdata/RefGenome/GCF_000002315.4_Gallus_gallus-5.0_genomic.fa -T CalculateGenotypePosteriors —supporting ../../../../snp_db/vcf_chr_1-28_30_32.vcf.gz -ped trio.ped -V genotyped.trio.cohort.g.vcf -o recalibratedVariants.postCGP.vcf

java -jar ../../../../../zyan/tools/GenomeAnalysisTK.jar -T VariantFiltration -R ../../../../../zyan/pdata/testdata/RefGenome/GCF_000002315.4_Gallus_gallus-5.0_genomic.fa -V recalibratedVariants.postCGP.vcf -G_filter “GQ < 20.0” -G_filterName lowGQ -o recalibratedVariants.postCGP.Gfiltered.vcf

java -jar ../../../../../zyan/tools/GenomeAnalysisTK.jar -T VariantAnnotator -R ../../../../../zyan/pdata/testdata/RefGenome/GCF_000002315.4_Gallus_gallus-5.0_genomic.fa -V recalibratedVariants.postCGP.Gfiltered.vcf -A PossibleDeNovo -ped trio.ped -o recalibratedVariants.postCGP.Gfiltered.deNovos.vcf

I used hard-filtered snp (genotyped.trio.cohort.g.vcf) as input.

I followed the documentation.

Is there something wrong with these three command-lines??

Thank you very much

Dr yan

From Sheila on 2017-04-25

@zejunyan

Hi Dr. Yan,

No, those look fine. Have a look at Geraldine’s suggestion above as well.

-Sheila

From Vergilius on 2017-08-03

I was wondering whether I have a Joint analysis of X samples containing Y trios. Should I run a genotype refinement for each trio (using a single ped for the single family) or should I run the GRW for all the samples together and using in input a general PED file contianing information of all the families?

From Sheila on 2017-08-16

@Vergilius

Hi,

I think you can input a general ped file containing information for all trios.

-Sheila

From Vergilius on 2017-08-19

@Sheila

Ok! It’s what I did. Looks to work fine. Thanks

From mcvu on 2017-11-20

Hello,

I am working with hg19 reference, and so the contigs in 1000G_phase3_v4_20130502.sites.vcf are not compatible. Will it make much difference to use 1000G_phase1.snps.high_confidence.hg19.sites.vcf from the bundel? Considering it’s phase 1 not phase 3?

Thanks!

From Sheila on 2017-11-27

@mcvu

Hi,

Yes, that is the file you should use. The 1000G_phase3_v4_20130502.sites.vcf file is for use with b37 reference (which is what we use in our example commands) :smile:

-Sheila

From ccs76 on 2018-11-22

Hello

Thanks so much for those workflows :)

I’m pretty new in the materia, if I understand right this workflow is for VQSR filtering, which one is in case of apply hard filtering? Im interested on call the novo mutations in my trio.

Thanks

From ccs76 on 2018-11-22

Hello again

I have another question about de novo variants callers and somatic caller (atm no one answered me in the forums), maybe this is not the right place and its weird what I do. If I merge the parental bam files and I consider them as normal cell against the child (consider as tumor) those somatic variants results shouldnt be similar as if I use directly the bam files from the parents and child with de novo trio callers tools? In my case I used VarScan tools. Im trying to learn now using GATK.

Thanks

From lianman on 2019-03-22

Hello,

Does this workflow works on INDELs Now? When i did the second step, i got this error:

ERROR MESSAGE: Invalid argument value '<' at position 8.

ERROR Invalid argument value '20.0"' at position 9.

and this is my command java -Xmx16g -jar /home/mgujral/tools/GATK/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T VariantFiltration -R /oasis/projects/nsf/ddp195/dantakli/reference/GRCh38fullanalysissetplusdecoyhla.fa -V /home/a1lian/recalibratedVariants.postCGP.vcf -Gfilter "GQ < 20.0" -GfilterName lowGQ -o 11000.SSC02220.Gfiltered.vcf

From ccs76 on 2019-08-22

Hello

Im pretty confuse, hope someone can help me. I had done alignment and call variants using GATK tutorials and reference genome from ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.*.fa.gz is there any way to get —supporting 1000G_phase3_v4_20130502.sites.vcf compatibility with my RF or I need to start over again using the human_g1k_v37_decoy.fasta ?

Thanks

Report abuse