Post date: Dec 10, 2019 2:55:43 AM
Next step - Identify somatic mutations in Pando, and compare in the other clones, following these steps:
Create a list (dictionary) of all of the the putative somatic mutations (the stuff you keep after filtering for high heterozygosity in pando). Record the scaffold and position for each. Then go through the vcf files for the panel of normals and the pando friends. For each SNP in those files, see if it matches (scaffold and position) a SNP in the somatic mutation set. Then, from that, print out the number/propotion of putative somatic mutations with a match to each of the two other data sets. If you want to get a bit fancier, you could test for a match in terms of scaffold, position, and non-reference allele.
# Understanding what the next step is.
1 - I filtered the vcf files.
2 - I reverted thr Phrd score to proba.
3 - I looked at the heterozygote proba, if it is above threshold, I call it true hets. If it is below 0.01, I call it true homs.
4 - I plot the proportion of hets as compared to hets + homs, leaving aside the NAs for every SNPs.
5 - From this plot, we see that whatever threshold we use, there is a cut at 0.8. There is a decent number of SNPs with >80% of hets. These hets are the "true" hets, shared in the whole clone and can be considered as somatic mutations. Do we consider then that the background is homozygous? Do we consider it is normal to be homozygous? And it is also normal to be heteroZ if the proportion of hets is above 80 in one SNP?
6 - So what I need to do now is to take the SNPs that have more than 80% hets.
To do:
- push 20191209_filter_common_variants and try it on the cluster (function)
Code that calls the heteroZ label file and counts proportions according to thesholds:
#######################################################################################
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Calls in the function that revert Phred scores to probabilities and assign genotypes (homoZ, heteroZ)
Calculates the proportion of hets.
Boolean vector with 1s for the lines to keep that have a proportion of hets inferior to the chosen theshold.
Read the original vcf file and create a dictionary with the scaffold name, mutation start and stop, and nucleotide change
"""
def main(t, output_vec, T, n):
# Choose the threshold to separate true homozygotes and true hets from the rest
#n # number of individuals expected in the file
t = str(t) # file name has the threshold as a string.
# read in the file
filename = "heteroZ_label_{}.txt".format(t)
final = open(output_vec, "w+")
file = open(filename, "r")
for line in file:
line = line.strip()
store = line.split()
# Control here for the good length
if len(store) == n:
hets = 0 # Records the number of hets
homs = 0 # Records the number of homs
NA = 0 # Record non-assigned genotypes
for item in store:
if item == "1":
hets += 1
elif item == "0":
homs += 1
elif item == "2":
NA += 1
if NA != n:
prop = hets/(hets+homs)
if prop >= T: # exclusive: everything bigger or equal to T is not kept
final.write("0\n") # We do not keep true hets, they receive a 0
elif prop < T:
final.write("1\n") # We keep every hets that has a proportion less than 80% per SNP
else:
print("Proportion does not make sense.")
print(hets,homs)
break
elif NA == n:
print("Only NA's in this line.")
final.write("0\n")
else:
print("The line has too many individuals.")
print(len(line))
break
if __name__ == "__main__":
main(0.06, "cheesecake.txt", .8, 109)
###################################################################################
Observations:
10100 SNPs after filtering for true hets with threshold 0.8.
10100 SNPs in the dictionary (it means there is no replicates and they are all different)
Code 1 -
creates a vector of 1s and 0s for the lines to keep
Code 2 -
filters the vcf file sccoridng to vector from code 1 and outputs a dictionary
How to take into account the non-reference allele?