Post date: Mar 03, 2020 11:7:24 PM
Ouch! 20 days I have not looked into this, it is hard to dig back into it... I will try for the month to come to focus 2 hours per day on it, as I focus 2 hours per day on a paper.
Email answer from Zach:
################################################################################################
1. Heterozygosity.
- This looks reasonable. Perhaps a slight excess of observed het., especially in the tails. Just what you would expect for somatic mutations.
- For triploids expected heterozyogsity is just 1-(p^3 + q^3), that is one minus the probability of homozygosity.
2. PCA
- Agreed, there doesn't appear to be any clustering by PC scores and the first few PCs don't dominate things. This would be expected if any given somatic mutation is mostly found in a few individuals.
- No, you can't put 2 in for missing data, that messes up the PCA. If anything, maybe give them a 0. Or, better yet, just use the probability of het. as your variable. This gets around the problem (this ranges from 0 to 1).
- Once you try that, put it on a map again, and compare big and small trees. Maybe make a basic plot of genetic distance (from somatic mutations) by geographic distance too. And yes, try k-means or hierarhical clustering.
################################################################################################
Paths I use all the time:
scp -r /Volumes/Data/Doc_Rozenn/Education/GaTech/Year2/Spring2020/Research/data/Pando/findSNPs/Phred_score_to_proba.py u6028866@kingspeak.chpc.utah.edu:/uufs/chpc.utah.edu/common/home/u6028866/data/Pando/variants/pando_only_variants/scripts/convert_score_to_proba.py
cd data/Pando/variants/pando_only_variants/filter_hets_50
module load python/3.6.3
chmod 755 convert_score_to_proba.py
################################################################################################
What I can do is read the vcf file and only keep the Phred score for heterozygosity - then convert this score into a proba that is between 0 and 1.
I wrote this code to extract the Phred score from the vcf file, convert it to probability and only keep the proba of being hets:
################################################################################################
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Tue Mar 3 22:32:37 2020
This script reads in the vcf file and returns a matrix of genotypes.
Revert the genotype likelihoods from Phred scale to probabilities.
Proba are between 0 and 1.
Sum all three probabilities to get the sum (which will be more than one) and normalize the obtained proba by this number.
"""
# Modules needed
import re as regex
# Define the function
def revert_score(input_file, output_file):
# Open the file, read line by line
final = open(output_file,"w+")
with open(input_file) as file:
# Go through the file line by line
for line in file:
# if the line start with a scaffolfd name, in this case "Potrs" for Pop. tremuloides
if not line.startswith('Potrs'):
pass
print("pass")
else:
# Match the genotype likelihood (gl) in the vcf file
gl = regex.findall(r':\d+,\d+,\d+:', line)
# Loop over the gl
pre_final = []
for ind in gl:
ind = ind.split(',')
temp = []
for i in ind:
i = regex.sub(':','',i)
i = int(i)
# Revert the Phred score to proba
proba = 10**(i/(-10))
temp.append(proba)
sum_proba = sum(temp)
# This happens for each SNPs for each individual
for j in temp:
pre_final.append(j/sum_proba)
# This happens for each SNP
for item in pre_final[1::3]:
final.write(f'{item}\t')
final.write('\n')
# Call the function
revert_score("pando_50_stringent_filter.vcf","20200303_test.txt")
################################################################################################
I run this script on the cluster, on the filtered file (7715 SNPs)
I obtain a text file of 7715 x109 which is coherent (7715 SNPs, 109 individuals)
I plot the PCA basic plots here.
The problem is that Big and Small trees have the same coordinates.
I would like to know if they are really different from each other - how could I do that?
a similarity matrix?
What genetic distance can I use?