Post date: Jan 16, 2020 10:43:35 AM
Tasks done today:
- read DeRose et al. 2016: Cytotypes differences in radial increment provide novel insight on aspen reproductive ecology
cool information about diploidy and triploidy differences within the Pando clone. Hypothesize the stand started to form only 200 years ago?
- Found 421 shared SNPs in all stands
- I now have the allele frequency of the mutations I choose. This is p, the observed allele frequency. If the population was at the HW equilibrium, I would expect to see 2pq heterozygotes. I can calculate that based on p, but also get the observed number from the vcf file. (let's go)
I estimate the number of heterozygotes per SNP using the vcf file estimation: 1/1, 1/0, ./., ect ...
I wrote the following script: ./count_hets.py and store the results in two different columns in the folder res: one column is the number of hets per SNP, the other column is the total number of individual with data.
##################################################################
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Jan 16 06:22:41 2020
This script reads in a vcf file and counts the number of hets er SNPs.
It uses the prediction given by the file, and not the reverted Phred score as I have done before.
This is a quicker way that doesn't use a threshold to decide for heterozygosity (is it the way to go?)
"""
import re as regex
def count_hets(vcf_file, hets_out, tot_out):
hets_final = open(hets_out, "w+")
tot_final = open(tot_out, "w+")
with open(vcf_file) as file:
# Go through the file line by line
for line in file:
# if the line start with a scaffolfd name, in this case "Potrs" for Pop. tremuloides
if not line.startswith('Potrs'):
pass
else:
# Match the genotype likelihood (gl) in the vcf file
gl = regex.findall(r'[01.]/[01.]', line)
# Loop over the gl
hets = 0
tot = len(gl)
for ind in gl:
ind = ind.split('/')
if ind[0] == "." or ind[1] == ".":
tot = tot-1
pass
# Are both alleles the same?
if ind[0] != ind[1]:
hets += 1
hets_final.write(f'{hets}\n')
tot_final.write(f'{tot}\n')
# Call function here
count_hets("pando_lowhets_50.vcf", "./res/pando_lowhets_50_hets.txt", "./res/pando_lowhets_50_tot.txt")
Questions:
- now I have approx 10000 SNPs as potential interesting mutations.
Why did we look for them in the other stands (friends and pon)? Is it to delete them because they are more common?
How many SNPS are common between the three?
I write the script called compare_dict.py and I find 421 SNPs common between pando, friends and the panel of normals. (common_SNPs_pando_pon_friends.dict) - I also found a very quick way to do the same thing with the following command:
comm -12 <(sort common_pando_friends_50.dict) <(sort common_pando_pon_50.dict) > same.test