2020 - 01 - 13

Post date: Jan 13, 2020 10:23:6 AM

Tasks done today

- Read "What is eating the Pando clone" 2019, learned a lot of cool information about the forest and the grazers.

- Python script (below) to filter the SNPs from the vcf file that we want to keep for the analysis.

To launch the code creating and comparing dictionnaries (you need to have the "res" folder where you run the code)

/uufs/chpc.utah.edu/common/home/u6028866/data/Pando/variants/pando_only_variants/filter_common_hets.py

Numbers to keep in mind:

The Pando size is about 43 hectares, which is 43 soccer fields, or 43*48=2064 double-Decker buses parked together with no space.

What proportion of hets to keep, 50 or 80? find more numbers that can help me take a decision. We keep every SNPs that has less than 80% or 50% hets per snp.

11196 potential interesting SNPs when using 80% hets as a filter

9754 potential interesting SNPs when using 50% hets as a filter Number of unique SNPs

Friends: 33499

PON: 84395

There is 13670 SNPs in common between pon et friends (the vcf file was not filtered for low heterozygosity only, which explain the hogh number of candidates)

Python script to filter the SNPs from the vcf file that we want to keep for the analysis.

"""

This scripts extracts the lines from the vcf file that contain the SNPs we retain

after filtering for low heterozygosity.

The information is contained in the dictionnaries.

"""

# Function that recreates the vcf file from the dictionary

def filter_vcf_from_dict(input_dict, input_vcf, output_vcf):

print("Reading dictionary and vcf file..")

# Open the vcl file and keep the lines we are interested in from the vector created above

# Input dictionary

my_dict = open(input_dict, "r")

# Output vcf file

out_vcf = open(output_vcf, "w+")

# LOOP 1 - Open dictionary and store chromosome and position information we are looking for in the vcf file

####

for line in my_dict:

line = line.split()

chrom = line[0]

pos = line[1]

# LOOP 2 - Open vcf file, store info and compare

####

in_vcf = open(input_vcf, "r")

for vcf_line in in_vcf:

if not vcf_line.startswith('Potrs'):

pass

elif not vcf_line.startswith(chrom):

pass

else:

pos_vcf = vcf_line.split()[1]

if pos_vcf != pos:

pass

else:

out_vcf.write(vcf_line)

break

# Call the function here

filter_vcf_from_dict("pando_50.dict", "filtered2xHiCov_pando_only_variants.vcf", "pando_lowhets_50.vcf")

Now I need to calculate AC/AN to have an idea of the allele frequency of the alleles kept: control - what allele fre do we find

Pbm: il y a des AC/AN qui sont proches de 1. Pourquoi?