2020-03-03

Post date: Mar 03, 2020 11:7:24 PM

Ouch! 20 days I have not looked into this, it is hard to dig back into it... I will try for the month to come to focus 2 hours per day on it, as I focus 2 hours per day on a paper.

Email answer from Zach:

################################################################################################

1. Heterozygosity.

- This looks reasonable. Perhaps a slight excess of observed het., especially in the tails. Just what you would expect for somatic mutations.

- For triploids expected heterozyogsity is just 1-(p^3 + q^3), that is one minus the probability of homozygosity.

2. PCA

- Agreed, there doesn't appear to be any clustering by PC scores and the first few PCs don't dominate things. This would be expected if any given somatic mutation is mostly found in a few individuals.

- No, you can't put 2 in for missing data, that messes up the PCA. If anything, maybe give them a 0. Or, better yet, just use the probability of het. as your variable. This gets around the problem (this ranges from 0 to 1).

- Once you try that, put it on a map again, and compare big and small trees. Maybe make a basic plot of genetic distance (from somatic mutations) by geographic distance too. And yes, try k-means or hierarhical clustering.

################################################################################################

Paths I use all the time:

scp -r /Volumes/Data/Doc_Rozenn/Education/GaTech/Year2/Spring2020/Research/data/Pando/findSNPs/Phred_score_to_proba.py u6028866@kingspeak.chpc.utah.edu:/uufs/chpc.utah.edu/common/home/u6028866/data/Pando/variants/pando_only_variants/scripts/convert_score_to_proba.py

cd data/Pando/variants/pando_only_variants/filter_hets_50

module load python/3.6.3

chmod 755 convert_score_to_proba.py

################################################################################################

What I can do is read the vcf file and only keep the Phred score for heterozygosity - then convert this score into a proba that is between 0 and 1.

I wrote this code to extract the Phred score from the vcf file, convert it to probability and only keep the proba of being hets:

################################################################################################

#!/usr/bin/env python3

# -*- coding: utf-8 -*-

"""

Created on Tue Mar 3 22:32:37 2020

This script reads in the vcf file and returns a matrix of genotypes.

Revert the genotype likelihoods from Phred scale to probabilities.

Proba are between 0 and 1.

Sum all three probabilities to get the sum (which will be more than one) and normalize the obtained proba by this number.

"""

# Modules needed

import re as regex

# Define the function

def revert_score(input_file, output_file):

# Open the file, read line by line

final = open(output_file,"w+")

with open(input_file) as file:

# Go through the file line by line

for line in file:

# if the line start with a scaffolfd name, in this case "Potrs" for Pop. tremuloides

if not line.startswith('Potrs'):

pass

print("pass")

else:

# Match the genotype likelihood (gl) in the vcf file

gl = regex.findall(r':\d+,\d+,\d+:', line)

# Loop over the gl

pre_final = []

for ind in gl:

ind = ind.split(',')

temp = []

for i in ind:

i = regex.sub(':','',i)

i = int(i)

# Revert the Phred score to proba

proba = 10**(i/(-10))

temp.append(proba)

sum_proba = sum(temp)

# This happens for each SNPs for each individual

for j in temp:

pre_final.append(j/sum_proba)

# This happens for each SNP

for item in pre_final[1::3]:

final.write(f'{item}\t')

final.write('\n')

# Call the function

revert_score("pando_50_stringent_filter.vcf","20200303_test.txt")

################################################################################################

I run this script on the cluster, on the filtered file (7715 SNPs)

I obtain a text file of 7715 x109 which is coherent (7715 SNPs, 109 individuals)

I plot the PCA basic plots here.

The problem is that Big and Small trees have the same coordinates.

I would like to know if they are really different from each other - how could I do that?

a similarity matrix?

What genetic distance can I use?