Post date: Jan 31, 2020 12:44:7 PM
J'ai besoin d'un petit résumé de ce que j'ai fait pour savoir où j'en suis.
- Align the genome, call the variants and filter: full vcf file with mixed ancestor and de novo mutations.
- Read the vcf file, and turn the Phred score for the heterozygote genotype estimation back to a probability ([0:1]).
- Test different thresholds for the number of obtained homs, hets and NAs so that we keep the most information (not too many NAs) but have not to much hets (test between 0.01 and 0.1). we keep t=0.06 meaning
- hets if proba of hets for this individual is greater than 1-t = 0.94 (no information about the number of hets in this SNPs, this is just the certainty for this ind that it is really hets!)
- Next, turn every sure individual that is hets into 1s, homs into 0s
- count the proportion of 0s and 1s per SNP
- we obtain a distribution of the number of hets per SNP, with a lot of SNPs with very few hets, and a lot of SNPs with a lot of hets. We think that the SNPs with a lot of hets are ancestral mutations, shared by everyone in the stand. We think that the SNPs shared by less than 50% of the population are de novo mutations.
- We filter the file to only keep this de novo mutations.
- We check in the two groups we created as comparisons: PON and friends what are the shared mutations between them.
- we get rid of these shared mutations (sequencing errors, or hypermutable states that will perturb our signal)
1- What SNPs to get rid of?
Strategy one: be conservative and get rid of common to Pando and PON AND of common to Pando friends.
a - run /data/scripts/filter_vcf_from_dict_keep.py
reads in the vcf file, the dictionary that contains the SNPs we want to keep and only keep lines with a match.
b - run /data/scripts/filter_vcf_from_dict_delete.py