Labeling heterozygotes

Post date: Dec 04, 2019 12:6:37 AM

Working on step 2 described below, I used the following code to generate a matrix that labels 1 the heterozygotes, 0 the homozygotes and "NA" when we are not sure.

I tried several thresholds: 0.01, 0.02, 0,05 for heteroZ did not give any heteroZ.

Looking at the file of probabilities, it looks like all of them are: very small number .5 .5

Did I do something wrong in the calculations?

Lines where there is only 0s (I thought I had gotten rid of them)

Update: I was not looking at the right number for the genotype likelihoods in the vcf file. The below code is corrected and gives better results now.

######################################################################################

Python code

#####################################################################################

#!/usr/bin/env python3

# -*- coding: utf-8 -*-

"""

This script reads in the vcf file and returns a matrix of genotypes.

Revert the genotype likelihoods from Phred scale to probabilities.

Proba are between 0 and 1.

Sum all three probabilities to get the sum (which will be more than one) and normalize the obtained proba by this number.

Then apply the filters:

If the proba of heteroZ is >0.99 then call it an heterozygote and code it with a 1.

If the proba of heteroZ is <0.01 then call it an homozygote and assign a 0.

If the proba of heteroZ is between 0.01 and 0.99, assign "NA".

Then calculate proportion of heteroZ for each allele and make the graph of number of number of heteroZ kept as a function of the cut.

"""

# Part 1 - transform Phred scores into probabilities

# Modules needed

import re as regex

import argparse

# import math # to calculate log functions

# Arguments to be read

parser=argparse.ArgumentParser()

parser.add_argument("-i","--inputfile", required=True, help="Input vcf file.")

parser.add_argument("-o1","--outputfile1", required=True, help="Output file name 1, probas.")

parser.add_argument("-o2","--outputfile2", required=True, help="Output file name 2, heteroZ labels.")

# Read arguments

args = parser.parse_args() # Store arguments in parse

# Assign argument to variable name

input_file = args.inputfile

output_file1 = args.outputfile1

output_file2 = args.outputfile2

# Open the file, read line by line

final = open(output_file1,"w+")

with open(input_file) as file:

# Go through the file line by line

for line in file:

# if the line start with a scaffolfd name, in this case "Potrs" for Pop. tremuloides

if not regex.match(pattern = '^Potrs', string = line):

pass

else:

# Match the genotype likelihood (gl) in the vcf file

gl = regex.findall(r'\d+,\d+,\d+', line)

print(gl)

# Loop over the gl

pre_final = []

for ind in gl:

ind = ind.split(',')

temp = []

for i in ind:

i = int(i)

# Revert the Phred score to proba

proba = 10**(i/(-10))

temp.append(proba)

sum_proba = sum(temp)

for j in temp:

pre_final.append(j/sum_proba)

for item in pre_final:

final.write(f'{item}\t')

final.write('\n')

# Part 2 - label heterozygotes

'''

Open the created output file.

Walk through the file, start at 1 and then fo three by three to only look at heteroZ values.

If value > 0.99 --> 1 for heteroZ

If value < 0.01 --> 0 for homoZ

If 0.001 > value > 0.99 --> "NA"

'''

heteroZ = open(output_file2,"w+")

with open(output_file1) as file:

# Go through the file line by line

for line in file:

temp = []

line = line.rstrip() # Get rid of the new line character

line = list(line.split("\t")) # split the line into a list

proba_het = line[1::3] # Only keep one item every three items (heterozygote probability)

for proba in proba_het:

proba = float(proba) # from string to float

if proba > 0.99:

# The individual is heterozygote -- coded with 1

temp.append(1)

if proba < 0.01:

# The individual is homozygote -- coded with 0

temp.append(0)

# We don't know

else:

temp.append("NA") # -- coded with NA

for item in temp:

heteroZ.write(f'{item}\t')

heteroZ.write('\n')