Post date: Dec 04, 2019 12:6:37 AM
Working on step 2 described below, I used the following code to generate a matrix that labels 1 the heterozygotes, 0 the homozygotes and "NA" when we are not sure.
I tried several thresholds: 0.01, 0.02, 0,05 for heteroZ did not give any heteroZ.
Looking at the file of probabilities, it looks like all of them are: very small number .5 .5
Did I do something wrong in the calculations?
Lines where there is only 0s (I thought I had gotten rid of them)
Update: I was not looking at the right number for the genotype likelihoods in the vcf file. The below code is corrected and gives better results now.
######################################################################################
Python code
#####################################################################################
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
This script reads in the vcf file and returns a matrix of genotypes.
Revert the genotype likelihoods from Phred scale to probabilities.
Proba are between 0 and 1.
Sum all three probabilities to get the sum (which will be more than one) and normalize the obtained proba by this number.
Then apply the filters:
If the proba of heteroZ is >0.99 then call it an heterozygote and code it with a 1.
If the proba of heteroZ is <0.01 then call it an homozygote and assign a 0.
If the proba of heteroZ is between 0.01 and 0.99, assign "NA".
Then calculate proportion of heteroZ for each allele and make the graph of number of number of heteroZ kept as a function of the cut.
"""
# Part 1 - transform Phred scores into probabilities
# Modules needed
import re as regex
import argparse
# import math # to calculate log functions
# Arguments to be read
parser=argparse.ArgumentParser()
parser.add_argument("-i","--inputfile", required=True, help="Input vcf file.")
parser.add_argument("-o1","--outputfile1", required=True, help="Output file name 1, probas.")
parser.add_argument("-o2","--outputfile2", required=True, help="Output file name 2, heteroZ labels.")
# Read arguments
args = parser.parse_args() # Store arguments in parse
# Assign argument to variable name
input_file = args.inputfile
output_file1 = args.outputfile1
output_file2 = args.outputfile2
# Open the file, read line by line
final = open(output_file1,"w+")
with open(input_file) as file:
# Go through the file line by line
for line in file:
# if the line start with a scaffolfd name, in this case "Potrs" for Pop. tremuloides
if not regex.match(pattern = '^Potrs', string = line):
pass
else:
# Match the genotype likelihood (gl) in the vcf file
gl = regex.findall(r'\d+,\d+,\d+', line)
print(gl)
# Loop over the gl
pre_final = []
for ind in gl:
ind = ind.split(',')
temp = []
for i in ind:
i = int(i)
# Revert the Phred score to proba
proba = 10**(i/(-10))
temp.append(proba)
sum_proba = sum(temp)
for j in temp:
pre_final.append(j/sum_proba)
for item in pre_final:
final.write(f'{item}\t')
final.write('\n')
# Part 2 - label heterozygotes
'''
Open the created output file.
Walk through the file, start at 1 and then fo three by three to only look at heteroZ values.
If value > 0.99 --> 1 for heteroZ
If value < 0.01 --> 0 for homoZ
If 0.001 > value > 0.99 --> "NA"
'''
heteroZ = open(output_file2,"w+")
with open(output_file1) as file:
# Go through the file line by line
for line in file:
temp = []
line = line.rstrip() # Get rid of the new line character
line = list(line.split("\t")) # split the line into a list
proba_het = line[1::3] # Only keep one item every three items (heterozygote probability)
for proba in proba_het:
proba = float(proba) # from string to float
if proba > 0.99:
# The individual is heterozygote -- coded with 1
temp.append(1)
if proba < 0.01:
# The individual is homozygote -- coded with 0
temp.append(0)
# We don't know
else:
temp.append("NA") # -- coded with NA
for item in temp:
heteroZ.write(f'{item}\t')
heteroZ.write('\n')