filtering and combining vcf files

Post date: Nov 30, 2015 4:48:59 AM

We have two sets of vcf files each for diploids and triploids. An initial set (DIP, TRIP) include variants called initially in diploids and triploids, whereas the second set contains variant data for those variants called in the other ploidy. I wrote a script to combine the two files for each ploidy, while only retaining those at 2x coverage in each ploidy group. The script /labs/evolution/data/aspen/gbs/Assemblies/Scripts/combineVcfs.pl, takes no arguments. 301960 variants were retained and are in the files combinedDip.vcf and combinedTrip.vcf in /labs/evolution/data/aspen/gbs/Variants/.

Next, I ran vcfFilter.pl on these files to further filter based on the following criteria (note comments for diploids vs. triploids):

#### stringency variables, edits as desired

#my $minCoverage = 226; # minimum number of sequences; DP : use for 113 dips

my $minCoverage = 108; # minimum number of sequences; DP : use for 54 trips

#my $minAltRds = 5; # minimum number of sequences with the alternative allele; AC : use for 113 dips

my $minAltRds = 3; # minimum number of sequences with the alternative allele; AC :use for trips

my $notFixed = 1.5; # removes loci fixed for alt; AF : set to 1.5 not possble

my $dels = 0.1; # maximum prop. of reads spanning an indel; DEL

my $mq0 = 5; # maximum number of mapping quality 0 reads; MQ0

my $bqrs = 3; # maximum absolute value of the base quality rank sum test; BaseQRankSum

my $mqrs = 2; # maximum absolute value of the mapping quality rank sum test; MQRankSum

#my $rprs = 2; # maximum absolute value of the read position rank sum test; ReadPosRankSum

my $qd = 2; # minimum ratio of variant confidenct to non reference read depth; QD

my $mq = 30; # minimum mapping quality; MQ

#my $miss = 22; # maximum number of individuals with no data : use for dips

my $miss = 10; # maximum number of individuals with no data : use for trips

We retained 98,053 SNVs for triploid and 100,869 for diploids (compare this to an initial run based on the original variants called for each ploidy: diploids = 53,315, triploids = 99,362). These are in filtered2x_combined[Dips|Trips].vcf in the variants directory.

Finally, I subset these variants to only retain variant data for the intersection of both filtered sets (this essentially removes variants kicked out in either ploidy set). This was accomplished by subsetVcfs.pl in /labs/evolution/data/aspen/gbs/Assemblies/Scripts, which also has no arguments. The final vcf files sub_filtered2x_combined[Dips|Trips].vcf are in the variants directory and contain 63,345 variants (note that the header information has now been removed). Next I need to extract and potentially threshold the genotype likelihoods.

Page updated

Google Sites

Report abuse