2019-12-02 Filtering vcf files

Post date: Dec 02, 2019 10:17:11 PM

We called variants using bcftools on three different groups: Pando only with read depth >4, Pando friends with no read depth filter, panel of normals (100 other aspen trees) with total number of reads between 500 000 and 150 000.

Step 1

I apply stringent filters (extend explanation here) now using Zach's perl scripts (callFilters.sh calling vcfFilter.pl).

Step 2

I get rid of high coverage variants which I defined as more than mean + 2sd read depth (use getDepth_pando_friends_variants.R).

1 - Pando only

109 ind. Pando_only_variants.vcf has 236104 --> filtered_Pando_only_variants.vcf with 16057 SNPs retained --> filtered2xHicov_Pando_only_variants.vcf with 15543 SNPs retained.

2 - Pando friends

154 ind. pando_friends_variants.vcf has 399610 variants --> filtered_pando_friends_variants.vcf retained 35016 variable loci --> filtered2xHiCov_pando_friends_variants.vcf with 33499 retained SNPs.

3 - Panel of Normals

100 ind. pon_variants.vcf has 514012--> filtered_pon_variants retained 86633 variable loci --> filtered2xHiCov_pon_variants.vcf with 84395 SNPs retained.

Plots of read depth coverage can be found here.

Command to count the number of individuals per vcf file:

grep ^#CH pon_variants.vcf | awk '{print NF}'

--> gives you total number of colums

grep ^#CH pando_friends_variants.vcf | cut -f 10-163 | awk '{print NF}'

--> gives you from column 10 to end

Page updated

Google Sites

Report abuse