created by Geraldine_VdAuwera
on 2013-06-17
Objective
Call variants on a haploid genome with the UnifiedGenotyper, producing a raw (unfiltered) VCF.
Prerequisites
Steps
If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.
-ploidy
)In its basic use, this is the ploidy (number of chromosomes) per sample. By default it is set to 2, to process diploid organisms' genomes, but it can be set to any other desired value, starting at 1 for haploid organisms, and up for polyploids. This argument can also be used to handle pooled data. For that purpose, you'll need to set -ploidy
to Number of samples in each pool * Sample Ploidy
. There is no fixed upper limit, but keep in mind that high-level ploidy will increase processing times since the calculations involved are more complex. For full details on how to process pooled data, see Eran et al. (in preparation).
-glm
)This is the model that the program will use to calculate the genotype likelihoods. By default, it is set to SNP
, but it can also be set to INDEL
or BOTH
. If set to BOTH
, both SNPs and Indels will be called in the same run and be output to the same variants file.
-stand_emit_conf
)This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.
-stand_call_conf
)This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.
The terms called and filtered are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.
Run the following GATK command:
java -jar GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R haploid_reference.fa \ -I haploid_reads.bam \ -L 20 \ -glm BOTH \ --stand_call_conf 30 \ --stand_emit_conf 10 \ -o raw_ug_variants.vcf
This creates a VCF file called raw_ug_variants.vcf
, containing all the sites that the UnifiedGenotyper evaluated to be potentially variant.
Note that -L
specifies that we only want to run the command on a subset of the data (here, chromosome 20). This is useful for testing as well as other purposes. For example, when running on exome data, we use -L
to specify a file containing the list of exome targets corresponding to the capture kit that was used to generate the exome libraries.
Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.
Updated on 2016-02-11
From BerntPopp on 2014-04-22
Hey Geraldine,
currently I am experimenting with software for rare variant detection in pooled WES samples. GATK works fantastic for SNVs with ploidy 24 but can not find some (real) indels although one can see the reads in IGV and the indel-reads at these positions represent a similar fraction of the whole coverage at that position when compared to SNVs.
I am using “-glm BOTH” and also I tried reducing “-minIndelFrac” and “-stand_call_conf” / “-stand_emit_conf”.
SNVer 0.5.3 calles most of these indels.
Do you have any other suggestions?
From Geraldine_VdAuwera on 2014-04-22
Unfortunately UG is just not very good at calling indels. HaplotypeCaller is much better; we are working on adapting it to call variants in non-diploid data. In the meantime however there’s not much I can say since it seems you are already doing the right thing, lowering the conf thresholds and the min indel fraction. I can’t think of anything else that would help, sorry.
From BerntPopp on 2014-04-22
Thanks Geraldine,
interestingly HC also calls most of the SNVs in the pooled sample although it has no ploidy option yet. But it does also not emit the indels.
Do you know when the ploidy option will be added to HC?
Cheers,
Bernt
From Geraldine_VdAuwera on 2014-04-22
It’s hard to give a timeline because this involves some pretty big modifications to get rid of the ploidy assumptions and generalize the model. The first milestone is going to be to make HC able to deal with haploid organisms, and I expect that will be a matter of weeks. Full polyploidy support is probably going to be a matter of months.
From Geraldine_VdAuwera on 2015-05-08
Note that the HaplotypeCaller is now capable of calling non-diploid organisms using the same ploidy option (since GATK 3.2).
From BobHarris on 2016-02-09
FYI, for other novices like myself, GATK (version 3.5) appears to reject the double-dashed version of the options glm, stand_call_conf and stand_emit_conf.
Another issue with the example command in this tutorial is “-L 20”. UnifiedGenotyper’s —help provides no information about this other than that 20 must be an “intervals”. Upon running the command, an error message clued me into the fact that “20” is expected to be, I guess, the name of a chromosome or contig in my reference fasta. I was trying to run a simple test with a reference that only has three short sequences named chr1, chr2, and chr3. For now I’ve just removed that option; probably I will need to understand it in the future, and I hope I’ll locate something that documents it.
From Sheila on 2016-02-10
@BobHarris
Hi,
Yes, you are correct that -L 20 tells the program to run strictly on chromosome 20. Have a look at the tool documentation for more information on what the arguments mean.
The tool documentation is [here](https://www.broadinstitute.org/gatk/guide/tooldocs/). You can find all the argument information for all the tools there.
-Sheila
From Geraldine_VdAuwera on 2016-02-10
Most of these tutorials were derived from a single full tutorial where the purpose of `-L 20` was explained at the beginning, but we omitted that when we broke it down into separate tutorials. We’ll try to address this in the near future since it seems to be a common problem.
From BobHarris on 2016-02-11
Thanks, S and G. Once I proceeded to the HaplotypeCaller tutorial, I noticed some additional info there that pointed me to article 4133, which is very helpful.
I understand that -L 20 was used in this example so that the command will produce results quicker than it would for the whole (presumed human) genome. It does assume, though, that the genome has a chromosome 20 and that the sequences are named numerically.
As I noted in the other HaplotypeCaller thread, these tutorials are very helpful. My main reason for posting is in case someone else runs into the same issues I did.
From bbsunchen on 2016-03-20
Is there any documents that I can read about the detail of HC and UG’s algorithm, so that I can get a clear and detail idea what is the difference between them?
From Sheila on 2016-03-22
@bbsunchen
Hi,
Answered [here](http://gatkforums.broadinstitute.org/gatk/discussion/7268/in-term-of-algorithms-what-is-the-difference-between-haplotypecaller-and-unifiedgenotyper#latest). No need to post twice as we aim to answer every question that comes through regardless of where they are posted.
-Sheila