2020-04-23

Post date: Apr 23, 2020 6:55:21 PM

All analysis are based of these files:

vcf Pando: /uufs/chpc.utah.edu/common/home/u6028866/data/Pando/variants/pando_only_variants/vcf/filtered2xHiCov_pando_only_variants.vcf

vcf friends: /uufs/chpc.utah.edu/common/home/u6028866/data/Pando/variants/pando_friends_variants/filtered2xHiCov_pando_friends_variants.vcf

vcf PON: /uufs/chpc.utah.edu/common/home/u6028866/data/Pando/variants/pon_variants/filtered2xHiCov_pon_variants.vcf

I check every script now, and rewrite some of them to make them more simple and clearer. I will draw the new pipeline here.

1 - extract the Phred-score transformed likelihood from the vcf file, and convert it back to a likelihood.

LOCATION: /uufs/chpc.utah.edu/common/home/u6028866/data/Pando/variants/pando_only_variants/1-find_proba_hets/extract_proba_hets.py

SCRIPT NAME: extract_proba_hets.py

COMMAND: open the script and change the name of the file in the call of the function.

OUTPUT: one txt file that has one SNPs per line, and one proba per column (individuals).

2 - label "true" heterozygotes based on a changing threshold (script to find which threshold to use)

LOCATION: /uufs/chpc.utah.edu/common/home/u6028866/data/Pando/variants/pando_only_variants/1-find_proba_hets/label_true_hets.py

SCRIPT NAME: label_true_hets.py

COMMAND: open the script and change the name of the file in the call of the function.

OUTPUT: as many txt files as there is thresholds tried.

--> Use 0.94 as a threshold.

3 - filter vcf based on proportion of heterozygotes per SNPs

3 - a) create Boolean vector (0/1) if SNPs proportion of hets is less than 50%, or less than 80%

3 - b) remove header from vcf file

3 - c) filter vcf file based on boolean vector

3 - a) LOCATION: /uufs/chpc.utah.edu/common/home/u6028866/data/Pando/variants/pando_only_variants/1-find_proba_hets/Find_threshold_plots.R

SCRIPT NAME: Find_threshold_plots.R

OUTPUT: two boolean vectors.

3 - b) LOCATION:/uufs/chpc.utah.edu/common/home/u6028866/data/Pando/variants/pando_only_variants/2-filter_vcf_for_low_hets/remove_vcf_header.py

SCRIPT NAME: remove_vcf_header.py

COMMAND: open the script and change the name of the file in the call of the function.

OUTPUT: vcf file without the header.

3 - c) LOCATION: /uufs/chpc.utah.edu/common/home/u6028866/data/Pando/variants/pando_only_variants/2-filter_vcf_for_low_hets/filter_vcf_based_on_bool.py

SCRIPT NAME: filter_vcf_from_bool.py

COMMAND: open the script and change the name of the file in the call of the function.

OUTPUT: filtered vcf file based on boolean vector.

4 - Create dictionary for each vcf file: low hets Pando vcf file, PON vcf file and friends vcf file.

LOCATION: /uufs/chpc.utah.edu/common/home/u6028866/data/Pando/variants/pando_only_variants/3-filter_vcf_for_germline_SNPs/1-create_dict/

create_dict.py

SCRIPT NAME: create_dict.py

COMMAND: open the script and change the name of the file in the call of the function.

OUTPUT: dictionary which has actually the same size as the vcf file = there is no duplicate SNPs.

9745 pando_only_variants_50_hets.dict

11196 pando_only_variants_80_hets.dict

33501 pando_friends.dict

84397 pon.dict

5 - Compare dictionaries and separate similarities from differences.

LOCATION: /uufs/chpc.utah.edu/common/home/u6028866/data/Pando/variants/pando_only_variants/3-filter_vcf_for_germline_SNPs/2-compare_dict/

compare_dict.py

SCRIPT NAME: compare_dict.py

COMMAND: open the script and change the name of the file in the call of the function.

OUTPUT: dictionary that contains the intersection between both input dictionaries.

1003 intersection_SNPs_pando_friends_50.dict

1215 intersection_SNPs_pando_friends_80.dict

1448 intersection_SNPs_pando_pon_50.dict

1662 intersection_SNPs_pando_pon_80.dict

421 intersection_all_50.dict

507 intersection_all_80.dict

6- filter the vcf file from the dict created in 5). Here the goal is to obtain the vcf file intersection between groups to later compare it to the original file and only keep the difference.

LOCATION: /uufs/chpc.utah.edu/common/home/u6028866/data/Pando/variants/pando_only_variants/3-filter_vcf_for_germline_SNPs/3-filter_vcf_from_dict/filter_vcf_from_dict.py

SCRIPT NAME: filter_vcf_from_dict.py

COMMAND: open the script and change the name of the file in the call of the function.

OUTPUT: vcf file that contains the intersection between two vcf files, using the previously created dictionary.

421 intersection_SNPs_all_50.vcf

507 intersection_SNPs_all_80.vcf

1003 intersection_SNPs_pando_friends_50.vcf

1215 intersection_SNPs_pando_friends_80.vcf

1448 intersection_SNPs_pando_pon_50.vcf

1662 intersection_SNPs_pando_pon_80.vcf

7 - compare the vcf file intersection obtained in 6) to another vcf file, and keep the difference between them.

LOCATION: /uufs/chpc.utah.edu/common/home/u6028866/data/Pando/variants/pando_only_variants/3-filter_vcf_for_germline_SNPs/3-filter_vcf_from_dict/keep_difference_between_vcf.py

SCRIPT NAME: keep_difference_between_vcf.py

COMMAND: open the script and change the name of the file in the call of the function.

OUTPUT: vcf file that contains the difference between two vcf files, using the previously intersection vcf file.

9324 difference_SNPs_all_50.vcf

10689 difference_SNPs_all_80.vcf

8742 difference_SNPs_pando_friends_50.vcf

9981 difference_SNPs_pando_friends_80.vcf

8297 difference_SNPs_pando_pon_50.vcf

9534 difference_SNPs_pando_pon_80.vcf

7715 difference_SNPs_stringent_filter_50.vcf

8826 difference_SNPs_stringent_filter_80.vcf

8 - Run "extract_proba_hets.py" (1) again on vcf files from 7).