Purpose and operation of Readbacked Phasing

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by delangel

on 2012-07-23

This document describes the underlying concepts of physical phasing as applied in the ReadBackedPhasing tool. For a complete, detailed argument reference, refer to the tool documentation page.

Note that as of GATK 3.3, physical phasing is performed automatically by HaplotypeCaller when it is run in -ERC GVCF or -ERC BP_RESOLUTION mode, so post-processing variant calls with ReadBackedPhasing is no longer necessary unless you want to merge consecutive variants into MNPs.

Underlying concepts

The biological unit of inheritance from each parent in a diploid organism is a set of single chromosomes, so that a diploid organism contains a set of pairs of corresponding chromosomes. The full sequence of each inherited chromosome is also known as a haplotype. It is critical to ascertain which variants are associated with one another in a particular individual. For example, if an individual's DNA possesses two consecutive heterozygous sites in a protein-coding sequence, there are two alternative scenarios of how these variants interact and affect the phenotype of the individual. In one scenario, they are on two different chromosomes, so each one has its own separate effect. On the other hand, if they co-occur on the same chromosome, they are thus expressed in the same protein molecule; moreover, if they are within the same codon, they are highly likely to encode an amino acid that is non-synonymous (relative to the other chromosome). The ReadBackedPhasing program serves to discover these haplotypes based on high-throughput sequencing reads.

How it works

The first step in phasing is to call variants ("genotype calling") using a SAM/BAM file of reads aligned to the reference genome -- this results in a VCF file. Using the VCF file and the SAM/BAM reads file, the ReadBackedPhasing tool considers all reads within a Bayesian framework and attempts to find the local haplotype with the highest probability, based on the reads observed.

The local haplotype and its phasing is encoded in the VCF file as a "|" symbol (which indicates that the alleles of the genotype correspond to the same order as the alleles for the genotype at the preceding variant site). For example, the following VCF indicates that SAMP1 is heterozygous at chromosome 20 positions 332341 and 332503, and the reference base at the first position (A) is on the same chromosome of SAMP1 as the alternate base at the latter position on that chromosome (G), and vice versa (G with C):

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMP1 chr20 332341 rs6076509 A G 470.60 PASS AB=0.46;AC=1;AF=0.50;AN=2;DB;DP=52;Dels=0.00;HRun=1;HaplotypeScore=0.98;MQ=59.11;MQ0=0;OQ=627.69;QD=12.07;SB=-145.57 GT:DP:GL:GQ 0/1:46:-79.92,-13.87,-84.22:99 chr20 332503 rs6133033 C G 726.23 PASS AB=0.57;AC=1;AF=0.50;AN=2;DB;DP=61;Dels=0.00;HRun=1;HaplotypeScore=0.95;MQ=60.00;MQ0=0;OQ=894.70;QD=14.67;SB=-472.75 GT:DP:GL:GQ:PQ 1|0:60:-110.83,-18.08,-149.73:99:126.93

The per-sample per-genotype PQ field is used to provide a Phred-scaled phasing quality score based on the statistical Bayesian framework employed for phasing. For cases of homozygous sites that lie in between phased heterozygous sites, these homozygous sites will be phased with the same quality as the next heterozygous site.

Note that this tool can only handle diploid data properly. If your organism of interest is polyploid or if you are working with data from pooling experiments, you should not run this tool on your data.

More detailed aspects of semantics of phasing in the VCF format

The "|" symbol is used for each sample to indicate that each of the alleles of the genotype in question derive from the same haplotype as each of the alleles of the genotype of the same sample in the previous NON-FILTERED variant record. That is, rows without FILTER=PASS are essentially ignored in the read-backed phasing (RBP) algorithm.
Note that the first heterozygous genotype record in a pair of haplotypes will necessarily have a "/" - otherwise, they would be the continuation of the preceding haplotypes.
A homozygous genotype is always "appended" to the preceding haplotype. For example, any 0/0 or 1/1 record is always converted into 0|0 and 1|1.
RBP attempts to phase a heterozygous genotype relative the preceding HETEROZYGOUS genotype for that sample. If there is sufficient read information to deduce the two haplotypes (for that sample), then the current genotype is declared phased ("/" changed to "|") and assigned a PQ that is proportional to the estimated Phred-scaled error rate. All homozygous genotypes for that sample that lie in between the two heterozygous genotypes are also assigned the same PQ value (and remain phased).
If RBP cannot phase the heterozygous genotype, then the genotype remains with a "/", and no PQ score is assigned. This site essentially starts a new section of haplotype for this sample.

For example, consider the following records from the VCF file:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMP1 SAMP2 chr1 1 . A G 99 PASS . GT:GL:GQ 0/1:-100,0,-100:99 0/1:-100,0,-100:99 chr1 2 . A G 99 PASS . GT:GL:GQ:PQ 1|1:-100,0,-100:99:60 0|1:-100,0,-100:99:50 chr1 3 . A G 99 PASS . GT:GL:GQ:PQ 0|1:-100,0,-100:99:60 0|0:-100,0,-100:99:60 chr1 4 . A G 99 FAIL . GT:GL:GQ 0/1:-100,0,-100:99 0/1:-100,0,-100:99 chr1 5 . A G 99 PASS . GT:GL:GQ:PQ 0|1:-100,0,-100:99:70 1|0:-100,0,-100:99:60 chr1 6 . A G 99 PASS . GT:GL:GQ:PQ 0/1:-100,0,-100:99 1|1:-100,0,-100:99:70 chr1 7 . A G 99 PASS . GT:GL:GQ:PQ 0|1:-100,0,-100:99:80 0|1:-100,0,-100:99:70 chr1 8 . A G 99 PASS . GT:GL:GQ:PQ 0|1:-100,0,-100:99:90 0|1:-100,0,-100:99:80

The proper interpretation of these records is that SAMP1 has the following haplotypes at positions 1-5 of chromosome 1:

AGAAA GGGAG

And two haplotypes at positions 6-8:

AAA GGG

And, SAMP2 has the two haplotypes at positions 1-8:

AAAAGGAA GGAAAGGG

Note that we have excluded the non-PASS SNP call (at chr1:4), thus assuming that both samples are homozygous reference at that site.

Updated on 2015-05-15

From shawpa on 2013-06-07

According to what is written above, the readbackphasing doesn’t support insertions or deletions. If my vcf file contains indels, should I remove them before running the readbackphasing command. I did run this command without removing indels from my vcf file and I got phasing information from the indel lines in the vcf. Can I not trust these?

From Geraldine_VdAuwera on 2013-06-11

I believe the expected behavior for this tool is to ignore indels — I’ve asked the author of the tool (@fromer) to jump in here and give a definitive answer. Can you maybe post the lines with the indels that you say were phased?

From fromer on 2013-06-12

For now, the RBP algorithm does not include indels and considers only SNPs. In fact, it should treat indels and non-PASS (failing filter) SNPs the same way — it acts as if they’re not there in the VCF file.

Therefore, when looking at the output, you need to remember that RBP tries to phase successive SNPs as if no indel was in between them (so it might look like the indel in between is phased relative to the next SNP).

We’ve actually developed a more robust syntax for denoting phase that will cover situation such as these.

It’s important to remember that currently the “phased relative to” relationship is implied by “|” as being phased relative to the previous PASS-ing biallelic SNP.

From jorei499 on 2013-08-27

Is this really correct?

>“The proper interpretation of these records is that SAMP1 has the following haplotypes at positions 1-5 of chromosome 1:

>“1. AGAAA”

>“2. GGGAG”

I would interpret that pos 4 since it is 0/1 can not be phased with pos 3. Have I missed something or is it wrong in the example?

From amy on 2014-04-17

About the phased output vcf file, is that possible to require other FORMAT like AD and DP? I am interested in getting the allelic counts. Thank you :)

From HeidiJTP on 2014-04-17

I have the same question as jorei499, this example does not make sense to me. I am still struggling to understand the vcf format. I ran readbackedphasing on 100 samples (sequenced from PCR amplicons, each read covers all variable sites). I find in every case that the first heterozygous site is unphased. I can open the fasta alignment and easily see what the phasing should be. Is this normal? How do I interpret this?

205_phased.txt

300_phased.txt

From Geraldine_VdAuwera on 2014-04-17

@amy I’m sorry but I don’t understand your question, can you please clarify? Do you have a VCF file that does not have AD and DP for the samples?

From Geraldine_VdAuwera on 2014-04-17

@HeidiJTP‌, have a look at the part of the doc that says:

> Note that the first heterozygous genotype record in a pair of haplotypes will necessarily have a “/” – otherwise, they would be the continuation of the preceding haplotypes.

From HeidiJTP on 2014-04-17

Thanks so much for always responding quickly Geraldine. Does this mean that there is no way to determine the phase of the first heterozygous site using this method?

From Geraldine_VdAuwera on 2014-04-17

@HeidiJTP‌ The key here is that all the other sites after it are phased relative to that first site — so by definition, it is phased (arbitrarily). I agree that the notation is ambiguous and confusing, but that’s the format…

From HeidiJTP on 2014-04-18

Ok thanks. I wasn’t sure because the sample I was looking at had previously been typed differently, but I believe the issue was specific to that sample. Thanks for clearing that up for me!

From amy on 2014-04-22

> @Geraldine_VdAuwera said:

> amy I’m sorry but I don’t understand your question, can you please clarify? Do you have a VCF file that does not have AD and DP for the samples?

The phased VCF file is the output form Read-backed phasing which only includes format GT:GL:GQ:(PQ) (just like the example in this page). My question is that can I ask other FORMATs like AD and DP? I would like to know the allelic counts. Thank you Geraldine!

From Geraldine_VdAuwera on 2014-04-22

Hi @amy,

I think the latest version should output those fields, if you started out with a VCF produced by GATK. Let me know if you find that’s not the case.

From bvecchio on 2014-04-23

Hi Geraldine,

I am continuing my analysis of re-sequencing data for a 200kb region in 600 unrelated samples. I have successfully used the ReadBackedPhasing tool and believe I will be able to take the output and feed it into BEAGLE for additional population based phasing of INDELs (which are not phased via the read based algorithm). I am wondering if you know of a tool available which will generate a consensus sequence for phased regions, or that will easily allow determination of phase of SNPs in a given region. For example, I’d like to query the phased VCF output of BEAGLE for a given snp, SNP A, and determine if it was found on the same haplotype has SNP B in a given individual (without delving into the VCF by hand in a text editor).

As a side note, I was also able to use the VariantstoBinaryPed tool on our dataset for bi-allelic sites and wrote a code to convert multi-allelic sites into psudomarkers. However, the tool did have some trouble with the sex chromosomes, so I opted to remove them for the time being.

Thank you for your time,

-Briana

From Geraldine_VdAuwera on 2014-04-23

@bvecchio‌ There is no tool in GATK that will do exactly what you want. You can have the HaplotypeCaller output haplotype sequences using the -bamOut argument, but it doesn’t actually use the phasing info from VCFs. Starting from a VCF you can use AlternateReferenceMaker to output consensus sequence, but that will output IUPAC ambiguity codes for heterozygous sites, and will not use phasing info either. I have heard of people writing their own scripts to do what you want but I can’t recommend any as I haven’t used them myself.

Sex chromosomes tend to be problematic…

Good luck!

From amy on 2014-04-24

Hi Geraldine,

I started with a VCF file who do have FORMAT AD and DP. I tried the latest version 3.1-1 to do Read-backed phasing and things became a little complicate. First, I used a subset of the data, it works, but still do not have AD and DP. Then I tried all my data, it gave me “ A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8): ….. This might be a bug……MESSAGE: 255”. Before, with exactly the same command, it worked in version 2.8.

Do you think if I can use VariantAnnotator to get AD and DP? (e.g. -T VariantAnnotator -R xxx.fasta -I xxx.bam -o new.with.AD.DP.vcf -A DepthPerAlleleBySample -A Coverage —variant output.from.readbackedphasing.vcf -L interval.bed)

Thank you very much :)

From Geraldine_VdAuwera on 2014-04-24

Hi @amy,

Yes, I believe that should work.

Can you post the full text of the error you got when you tried with all the data? Including all the log output please?

From amy on 2014-04-25

Hi Geraldine,

Thank you for the answers:)

Some more naive questions about AD and DP: I know the difference between them is filtered or not, however, if I sum all AD, should that equal to the total number of reads (passed the general filter) in the .bam file? (In my results, they are different, 25% less in sum AD.) What exactly is the filter who make difference between AD and DP? If the .bam file comes from RNA-seq, do you think AD can represent allelic expression ?

Full text log output is attached.

Thank you for the help.

error.PNG

From fern on 2014-05-20

Dear Authors

Dose this phasing in the tool consider the paired end reads? If the one read of the pair shows variant “A”, after some distance, the other read of the pair exhibits variant “G”, rather than “T”. Then, the haplotype is AG, rather than AT.

From Geraldine_VdAuwera on 2014-05-20

@fern Yes, read pair information is used in the phasing process.

From zzq on 2016-01-25

Hi delangel Geraldine_VdAuwera ,

Very nice explanation for beginners. For me, I have a vcf obtained by GATK (version 3.5) HaplotypeCaller running in `-ERC GVCF` and then using `CombineGVCFs and GenotypeGVCFs`. You said that post-processing variant calls with ReadBackedPhasing is no longer necessary, but in my VCF file I can not see a genotype contains phased allele separator `|`. Does this need the further processing using GATK, if yes, which command can do this?

I want to phase this vcf file using beagle. I think, pre-phased using GATK and then pass it to beagle which will preserve the phase of the genotype during the analysis when a genotype contains separator `|` will give us a more accuracy result.

Many thanks.

best wishes!

From Sheila on 2016-01-25

@zzq

Hi,

With the latest version of GATK, HaplotypeCaller should do the phasing automatically. Can you please post the exact command you ran?

Thanks,

Sheila

From zzq on 2016-01-26

Hi @Sheila ,

The commands I ran like following,

for each sample,

`java -Xmx50g -jar GenomeAnalysisTK-3.5.jar \ -R ref.fa \ -T HaplotypeCaller -nct 8 \ -I $d.sorted.uniqe.rg.dedup.realn.bam \ -o $d.g.vcf \ —genotyping_mode DISCOVERY \ -stand_emit_conf 30 \ -stand_call_conf 30 \ -ERC GVCF

done`

then I will combine these GVCFs and genotype

`java -Xmx50g -jar GenomeAnalysisTK-3.5.jar \ -T CombineGVCFs \ -R ref.fa \ —variant gvcf1.gz \ —variant gvcf2.gz \ —breakBandsAtMultiplesOf 10000 \ -o combine.gvcf.gz`

`java -Xmx50g -jar GenomeAnalysisTK-35.jar \ -T GenotypeGVCFs -nt 16 \ -R ref.fa \ —variant combine.gvcf.gz \ -o combine.vcf`

I hope these information can help you with my problem.

Thank you!

From Sheila on 2016-01-26

@zzq

Hi,

Okay, Thanks. Can you please post screenshots of the original bam file and [bamout file](https://www.broadinstitute.org/gatk/guide/article?id=5484) of a region you think should be phased? Please also post the corresponding VCF records.

Thanks

Sheila

From npontikos on 2016-01-29

I don’t see the `|` symbol in my output.

However I have:

```

GT:GQ:HP 0/1:99:17690409-1,17690409-2

GT:GQ:HP 0/1:99:17690409-2,17690409-1:1258.14

```

Which I interpret as:

```

0|1

1|0

```

Correct?

From Sheila on 2016-01-29

@npontikos

Hi,

No, the genotypes that are phased will have the | in them. Can you please post the exact command you ran to get the phased output? Also, how did you generate the VCF you are using?

Thanks,

Sheila

From npontikos on 2016-01-29

@Sheila

Command:

```

java -jarGenomeAnalysisTK.jar -T ReadBackedPhasing -R human_g1k_v37.fasta -I A.bam —variant A.vcf -o SNPs_phased.vcf.gz —phaseQualityThresh 20

```

The VCF comes from GenotypeVCFs

From Sheila on 2016-01-29

@npontikos

Hi,

Thanks. Can you try setting the output file to a .VCF file instead of a .gz file? We have had some reports of issues with the .gz output.

-Sheila

From fromer on 2016-01-29

As the original author of the RBP tool, I’ll jump in briefly.

The interpretation of:

GT:GQ:HP 0/1:99:17690409-1,17690409-2

GT:GQ:HP 0/1:99:17690409-2,17690409-1:1258.14

is indeed as @npontikos takes it to mean:

The second site’s alternate allele (1) is on the same physical haplotype as the first site’s reference allele (0), and vice versa [second site’s 0 goes with first site’s 1]. This is based on the fact that the HP pairs line up in reverse order between these two genotypes.

And, indeed, in the old notation that RBP used to output, this would have been:

0/1

1|0

The reason we changed this is for multiple reasons of (ambiguity, incompleteness, possible inconsistency with trio-based phasing), where the HP tag more explicitly links up alleles at (perhaps non-consecutive) genotypes of the same sample.

From npontikos on 2016-01-31

ok thank you @fromer makes sense

From scastel on 2016-02-02

Since it seems like the documentation on this post is out of date, would it be possible to get either a manual page or an update to this page that reflects what the current output of the tool is?

From scastel on 2016-02-02

Meant to point out that the manual page does not detail the output at all. It simply says “Output – Phased VCF file”, which is not helpful.

From Sheila on 2016-02-02

@scastel

Hi,

Unfortunately, we have not spent much time documenting ReadBackedPhasing, as HaplotypeCaller now does phasing. The reason we still have ReadBackedPhasing is for users to merge MNPs. Hopefully, in the near future, HaplotypeCaller (or a new tool) will merge MNPs. For now, I will make a note to update this article.

-Sheila

From jason.harris on 2016-02-02

@Sheila To be fair, HaplotypeCaller only phases if using GVCF or BP_RESOLUTION. If we are emitting variants only, then we still need to use ReadBackedPhasing to phase SNPs, correct? So merging MNPs is not the only remaining use case for RBP.

From Geraldine_VdAuwera on 2016-02-02

The GVCF workflow is now the recommended best practice workflow; other use cases are less well supported because we have to prioritize our efforts.

From zzq on 2016-02-08

Hi @Sheila,

Sorry for the late reply. I found the VCF file produced by above commands will have a tag `PGT`. I think this will be useful for me. I should instead the `GT` using the `PGT` and then pass it to beagle. Do you think this will give me a more accuracy result ?

Many thanks!

From Sheila on 2016-02-08

@zzq

Hi,

I think [this page](https://www.broadinstitute.org/gatk/blog?id=4741) will help with information on the phasing annotations.

-Sheila

From MarisaMH on 2016-05-02

So following @npontikos question, how do I know which one is the first variation of an haplotype. I am assuming that:

GT:HP:PQ 0/1:28735467-1,28735467-2:3.01

GT:HP:PQ 0/1:28735467-2,28735467-1:3.01

GT:HP:PQ 0/1:29638368-1,29638368-2:3.01

Is explained like:

0/1

1|0

0/1

0|1

And not

0/1

1|0

0|1

Am I right?

From Sheila on 2016-05-02

@MarisaMH

Hi,

Yes, you are correct.

-Sheila

From MarisaMH on 2016-05-11

I have a question about the —variant option in the read backed phasing algorithm. I have a VCF file with some phased and some unphased sites, and I tried GATK for recover some phasing. But the resulting VCF, has more SNPs than the original. How is that possible? How does the —variant option actually works? Because I’ve read that the VCF input can be empty but for the headers.

EDIT: Solved

From Sheila on 2016-05-11

@MarisaMH

Hi,

I am happy you solved the problem on your own! :smile:

It would be nice if you could post your solution here so others can benefit as well.

Thanks,

Sheila

From everestial007 on 2016-08-11

npontikos Sheila

I went over the several Q/A on this forum but still have some questions.

1) So, if I still want to obtain vcf with a pipe (0|1) for the haplotypes after RBphasing, what are my options?

2) Also, I am not understanding how HP hags helps us identify which allele goes with which allele in the variants close to each other.

Mainly, I also don’t undertand the numbers 1 & 2 on HP tag, the number before the dash (-) is the variant position. But, how does -1, -2 help with which allele goes with which.

GT:AD:DP:GQ:HP:PL 0/1:54,88:142:99:336768-1,336768-2:3456,0,1971

GT:AD:DP:GQ:HP:PL 0/1:24,25:49:99:337443-1,337443-2:724,0,751

GT:AD:DP:GQ:HP:PL 0/1:22,25:47:99:337480-1,337480-2:702,0,637

Thanks,

From Geraldine_VdAuwera on 2016-08-11

@everestial007 There are no options for generating pipe-phased genotypes using RBP. However, if you call variants with HaplotypeCaller in GVCF mode, that will produce pipe-phased genotypes in the PGT field.

For the rest, we’ve covered the output of the RBP tool many times in the forum and in the tool documentation at https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_phasing_ReadBackedPhasing.php

From everestial007 on 2017-01-31

@Geraldine_VdAuwera :smile:

I am revisiting this old problem, once again.

What is the reason for providing the phase information using RBP vs. GVCF differently?

Phased after using GVCF:

`GT:AD:DP:GQ:PB:PC:PGT:PID:PL:PW 0/1:88,8:96:58:.,.:1.0:0|1:12237_G_C:58,0,3778:0|1`

Using RBP: `GT:GQ:HP 0/1:99:17690409-1,17690409-2` `GT:GQ:HP 0/1:99:17690409-2,17690409-1:1258.14`

Also, how can I start dicussion on GATK forum? Looks like I am not able to post any discussions. Can you please look into the problem?

From Geraldine_VdAuwera on 2017-02-01

Hi @everestial007,

To be frank, quoting the author of RBP, Menachem Fromer, “The reason is that we have not yet bothered to sync up the formats, presumably changing the RBP implementation to be the same (or similar to that in HC).”

It just hasn’t been a priority… but we’d happily take a pull request if someone wants to take this on!

Re: your posting problem, if you want to start a new thread you need to do it from the “Ask the team” section of the forum. There you will see a “Question” button in the left-hand menu. Let me know if that doesn’t work for you.

From everestial007 on 2017-04-06

Geraldine_VdAuwera Sheila @shlee : hope you are doing well. I wanted to add some tips based on my experience with phased variants data and also following on the conversation on http://gatkforums.broadinstitute.org/gatk/discussion/8686/phasing-via-haplotypecaller-vs-readbackedphasing

I think the way phasing is represented in a vcf file should be universal.

- My personal opinion is that tags ouput from RBP are little confusing (at least to the empirical biologist).

- Phase states from HC (haplotype caller) are way way better but I can personally tell the mining these phase states using pyVCF module (https://github.com/jamescasbon/PyVCF) which is most popular python module to work on vcf files will run into a problem – reason being the `PID tag` is represented like `12237_G_C` (a number and the starting alleles). So, there is an extra step involve in splitting the tag value and recollecting the integer value. Another issue which I am not sure about is, if these integer value is duplicated in other haplotype blocks.

- I have been working on phasing for sometime now and my personal experience tells that tags represented using pHASER https://github.com/secastel/phaser/tree/master/phaser is a better way to go, where you have a `phased local block genotype i.e PG` and `phased local block index i.e PI`. This keep the information simple and is also easily mined using `pyVCF` without having to add any tricks.

To add at the end: I have written a tool on python which takes the partially phased (ReadBackPhased) haplotypes (blocks) and stitches them to create a genome wide haplotype. For now it specifically works on hybrids but other upgrades may be coming soon. I am hoping this will be helpful to others who are having similar issues. https://github.com/everestial/pHASE-Stitcher

Thanks,

From Sheila on 2017-04-11

@everestial007

Hi,

Thanks for sharing.

I am not sure I understand “if these integer value is duplicated in other haplotype blocks”? Do you mean is the position repeated in more than one PID? If so, yes. The PID contains the first site in the phased sites. For example, if sites 1,2,3 are phased, the PIDs for all the 3 sites will contain 1.

I hope that clarifies things.

-Sheila

From Geraldine_VdAuwera on 2017-04-25

Thanks for sharing these observations, @everestial007. I’m not sure what is the current thinking re: representation of phasing in the field (it’s not just up to us) but our methods dev we’re just talking about PHASER in their journal club yesterday so I’m sure they’ll have some opinions. I’ll circulate your post among them and see if that stimulates some proposals to improve what we currently emit.

From everestial007 on 2017-04-26

@Sheila, I was actually wondering if that numeric values is unique among all the haplotypes generated for that sample. So, no other haplotype will have that numeric value, rite?? Because if there is a duplicate mining the vcf will generate two haplotype with same ID. I am sure that uniqueness is the case. I will have to go back to my data and check to verify this.

@Geraldine_VdAuwera, yes PHASER seems to be having a good way of representing the haplotype data and it block value in a very nice way. With haplotype generation (Not just the SNPs, etc) being the next big step in next gen data genome representation, I think there is need to have a very unique and universal way of representing this information. Let me know of any updates. Thanks,

From Sheila on 2017-04-27

@everestial007

Hi,

That is correct the IDs are unique for each haplotype.

-Sheila

From YingLiu on 2017-04-28

@fromer

do you have developed a indel phase method ?

and could you please provide me with detailed alrigthm method ?

thank you !

From everestial007 on 2017-04-30

@YingLiu : If you do phasing not using the RBP tool, but call vairants using GVCF method you will get the phased SNPs and indels.

See the output below for one of my sample: You can see that the first line has the phased indel (PGT and PID tag values). Still, with this I would not trust the phased GT values, but just PGT and PID. If you want genome wide phased state you can use Secastel’s PHASER https://github.com/secastel/phaser .

Depending upon the problem my new tool is designed to take PGT and PID values and stitch the highly confident haplotypes to create genomewide haplotype, but this method is designed for F1 hybrids or at least individuals from mixed populations.

https://github.com/everestial/pHASE-Stitcher

If you want to separate phased SNPs from phased InDels you can use pyVCF module or write you own python script.

2 2818 . AAACGGAAC A 5744.45 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.749;ClippingRankSum=0.00;DP=127;ExcessHet=7.0302;FS=7.221;InbreedingCoeff=-0.4144;MQ=52.77;MQRankSum=-6.070e+00;QD=16.41;ReadPosRankSum=1.08;SOR=1.026;set=InDels GT:AD:DP:GQ:PGT:PID:PL 0|1:59,68:127:99:0|1:2818_AAACGGAAC_A:2673,0,2354

2 2828 . T TC 5801.72 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=1.14;ClippingRankSum=0.00;DP=125;ExcessHet=7.0302;FS=5.790;InbreedingCoeff=-0.4024;MQ=52.64;MQRankSum=-6.039e+00;QD=17.53;ReadPosRankSum=0.497;SOR=0.975;set=InDels GT:AD:DP:GQ:PGT:PID:PL 0|1:58,67:125:99:0|1:2818_AAACGGAAC_A:2702,0,2228

2 2829 . T A 5861.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.642;ClippingRankSum=0.00;DP=125;ExcessHet=7.0302;FS=6.621;InbreedingCoeff=-0.4024;MQ=52.69;MQRankSum=-5.952e+00;QD=17.71;ReadPosRankSum=0.204;SOR=1.009;set=SNPs GT:AD:DP:GQ:PGT:PID:PL 0|1:58,67:125:99:0|1:2818_AAACGGAAC_A:2702,0,2228

2 2841 . T TCC 5036.99 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.249;ClippingRankSum=0.00;DP=112;ExcessHet=5.5175;FS=5.370;InbreedingCoeff=-0.3184;MQ=55.85;MQRankSum=-4.914e+00;QD=18.12;ReadPosRankSum=-1.458e+00;SOR=1.002;set=InDels GT:AD:DP:GQ:PL 0/1:54,58:112:99:2333,0,2129

2 2856 . T C 4030.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.769;ClippingRankSum=0.00;DP=90;ExcessHet=7.0302;FS=11.097;InbreedingCoeff=-0.4024;MQ=53.68;MQRankSum=-4.598e+00;QD=14.98;ReadPosRankSum=-1.760e+00;SOR=1.262;set=SNPs GT:AD:DP:GQ:PGT:PID:PL 0|1:47,43:90:99:0|1:2856_T_C:1684,0,1883

2 2865 . G C 3637.56 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-1.280e-01;ClippingRankSum=0.00;DP=74;ExcessHet=6.5821;FS=12.430;InbreedingCoeff=-0.3830;MQ=54.91;MQRankSum=-4.671e+00;QD=15.35;ReadPosRankSum=-2.313e+00;SOR=1.350;set=SNPs GT:AD:DP:GQ:PGT:PID:PL 0|1:43,31:74:99:0|1:2856_T_C:1448,0,1731

Sheila Geraldine_VdAuwera @shlee

`Sorry, the IDs are not unique for each haplotype at least for phased state from GVCF method.`

This is why:

In the first line you can see that PID value is `2818_AAACGGAAC_A` for chromosome 2. So, for the same sample if there is any haplotype starting at same position (say for chr 3) it would again have the PID as 2818. So, while mining the haplotype using pyVCF or other method you will have two haplotypes with same ID from two different regions of the genome; well the added allele info (_AAACGGAAC_A) could make the PID more rare, but still there could be same allele config starting at the same position in another chromosome. The best way to handle this would be either using PID = chr+PID, so for chr 2 PID becomes 22828 and for chr 3 PID become 32828; or just use the method proposed by Secastel @scastel .

Finally personal opinion: the phase output from RBP tool isn’t quite nice GT:GQ:HP 0/1:99:17690409-1,17690409-2 GT:GQ:HP 0/1:99:17690409-2,17690409-1:1258.14

You can see it’s not obvious to read the phase state, its hard to mine the phase states (also need to read two lines at the time which is a pain) and it’s hard to directly transfer the phase information to other phasing tool like Beagle, etc. So, staying with the piped method (|) of representing the phase state is still a good way to go.

From Sheila on 2017-05-04

@everestial007

Hi,

I am checking with the team what they think.

-Sheila

From Geraldine_VdAuwera on 2017-05-05

The PIDs are unique if used in combination with the variant location. I understand this may not be optimal considering how you’re parsing the information. However right now we have a lot on our plates and we can’t devote any resources to improving this, sorry.

From everestial007 on 2017-05-05

@Geraldine_VdAuwera : Sorry, I didn’t quite follow on “The PIDs are unique if used in combination with the variant location.” I think PID should be unique at the genome and sample level.

I am parsing my data using pyVCF, as standard module for VCF data. I understand that its overwhelming that you guys receive a lots of issues on daily basis,

but I was only pointing out 1) what are the limitations that exits in ReadBackphasing and GVCF phasing methods, though GVCF is better 2) explanation of the limitations and 3) how it may be improved.

From Geraldine_VdAuwera on 2017-05-06

And I agree with all your points — I’m just stating that we can’t do anything about it right now, because I wouldn’t want you or anyone else to wait in limbo for improvements when we’re not working on it. Sorry if it came out more abruptly than intended.

From wchen on 2018-05-31

Why my phased VCF doesn’t have the ‘HP’ field?

chr9 133730483 . G GCTCTAC 8688.73 . AC=2;AF=1.00;AN=2;BaseQRankSum=2.906;ClippingRankSum=0.241;DP=206;FS=12.876;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=-1.166;QD=30.72;ReadPosRankSum=3.987;SOR=0.437 GT:AD:DP:GQ:PL 1/1:7,197:204:99:8726,585,0

Commands used:

java -jar GenomeAnalysisTK-3.3.0/GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -I recal.bam —dbsnp dbsnp_138.hg19.vcf -L bed -stand_call_conf 30 -stand_emit_conf 10 -o variants-HC.vcf —validation_strictness LENIENT >& hc3.3.log

java -jar GenomeAnalysisTK-3.3.0/GenomeAnalysisTK.jar -T ReadBackedPhasing -R ucsc.hg19.fasta -L bed -I recal.bam —variant variants-HC.vcf -o phased.vcf —phaseQualityThresh 20 >& phase.log

Tried this: -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000

Got:

chr9 133730273 . C . . . GT:AD:DP:GQ:PL 0/0:354,0:354:99:0,120,1800

Thanks!

From shlee on 2018-05-31

Hi @wchen,

Can you try GATK v3.7 or v3.8 and see if the HP tag is still missing?

From everestial007 on 2018-09-22

I have been trying to know when did ReadBackPhasing start and who started it. Need to cite the first person who did it, but can’t seem to find it (even via google scholar). I am think people at BroadInst might know.

From ebanks on 2018-09-23

The main developer of this tool was Menachem Fromer, and he did this back in 2010/2011.

From samgao on 2018-11-04

Hi,

Could you help clarify how the PQ score is calculated? I’ve been looking at a ton of articles on the forum but haven’t had any luck. Thanks so much in advance!

From samgao on 2018-11-04

Hi,

Was ReadBackedPhasing removed from GATK4? I see ReadBackedPhasing.java in GATK3 ▸ src⁩ ▸ ⁨main⁩ ▸ ⁨java⁩ ▸ ⁨org⁩ ▸ ⁨broadinstitute⁩ ▸ ⁨gatk⁩ ▸ ⁨tools⁩ ▸ ⁨walkers⁩ ▸ ⁨phasing⁩, but not in the GATK4 source code folder.

From sprocha on 2018-11-13

@samgao

that seems to be implicit here “Note that as of GATK 3.3, physical phasing is performed automatically by HaplotypeCaller when it is run in -ERC GVCF or -ERC BP_RESOLUTION mode, so post-processing variant calls with ReadBackedPhasing is no longer necessary“

but I also would like to know exactly how to recover phasing info from the jointGeno vcf’s.

any info appreciated

Report abuse