Post date: Dec 02, 2019 10:17:11 PM
We called variants using bcftools on three different groups: Pando only with read depth >4, Pando friends with no read depth filter, panel of normals (100 other aspen trees) with total number of reads between 500 000 and 150 000.
Step 1
I apply stringent filters (extend explanation here) now using Zach's perl scripts (callFilters.sh calling vcfFilter.pl).
Step 2
I get rid of high coverage variants which I defined as more than mean + 2sd read depth (use getDepth_pando_friends_variants.R).
1 - Pando only
109 ind. Pando_only_variants.vcf has 236104 --> filtered_Pando_only_variants.vcf with 16057 SNPs retained --> filtered2xHicov_Pando_only_variants.vcf with 15543 SNPs retained.
2 - Pando friends
154 ind. pando_friends_variants.vcf has 399610 variants --> filtered_pando_friends_variants.vcf retained 35016 variable loci --> filtered2xHiCov_pando_friends_variants.vcf with 33499 retained SNPs.
3 - Panel of Normals
100 ind. pon_variants.vcf has 514012--> filtered_pon_variants retained 86633 variable loci --> filtered2xHiCov_pon_variants.vcf with 84395 SNPs retained.
Plots of read depth coverage can be found here.
Command to count the number of individuals per vcf file:
grep ^#CH pon_variants.vcf | awk '{print NF}'
--> gives you total number of colums
grep ^#CH pando_friends_variants.vcf | cut -f 10-163 | awk '{print NF}'
--> gives you from column 10 to end