Getting familiar with .vcf file

Post date: Nov 06, 2019 12:4:46 AM

I tried to get the data on the number of reads per SNP per individual this morning, which Zach solved with the following command:

grep ^Pot filtered2xHiCov_pando_variants.vcf | perl -p -i -e 's/^.+AD\s+//' | perl -p -i -e 's/\S+:(\d+):\d+,\d+/\1/g' > depthMatrix.txt

which extract matrix of depth/coverage per individual and SNP from the filtered vcf file.

The next steps are to visualize the data to make sure that we see what we expect using ordination methods such as PCA.

I should try with several read depths thresholds and compare to choose the one to keep.

We expect:

- germline mutation differences between Pando and the two nearby clones, that should spatially separate them on a PCA

- somatic mutations (when keeping rare variants only in Pando, here again, try several thresholds from 1%-5%) should also spread Pando ramets on a PCA

- it would also be interesting to conduct distance matrices comparing the spatial distance to the genetic distance for Pando only (correlation?)

To do so, I can use the following files and information:

filtered2xHiCov_pando_variants.vcf = "final" filtered vcf file
filtered2xHiCov_pando_variants.gl = created from the vcf file, contains two header rows, and then one row per SNP. Each row gives the three
- genotype likelihood for each individual. In other words, this is a genotype likelihood matrix.
pntest_filtered2xHiCov_pando_variants.txt = posterior mean estimates of genotypes. These were obtained from the genotype likelihoods and allele frequencies which serve as a prior. Here you have one row per SNP and one column per individual. The numbers are estimates of the number of reference alleles, and are between 0 and 2, but not constrained to be integer values.
mle_p_pando_plus.txt = MLE allele frequencies. One row per SNP. These estimates account for uncertainty in genotype and follow from the math in the Li paper.
p_pando_plus.txt = raw output from MLE allele frequency estimation. The file above is the 3rd column from this file. This file also contains the
SNP id and a naive allele frequency estimate.
depthMatrix.txt = matrix of the number of reads/coverage per individual (column) and SNP (row). This was obtained from the command at the top of this e-mail.

Page updated

Google Sites

Report abuse