Post date: Jun 07, 2018 4:41:31 PM
I used my own perl scripts to filter the variant set from GATK. This was all done in the Variants_2018 sub-directory.
1. First, I ran my standard vcfFilter.pl filter script,
perl ../Scripts/vcfFilter.pl lycTimeSeriesVariants.vcf
This made filtered2x_lycTimeSeriesVariants.vcf, which contains 36,225 SNPs. Here is the head of the script with relevant parameters:
#### stringency variables, edits as desired
my $minCoverage = 3072; # minimum number of sequences; DP
my $minAltRds = 20; # minimum number of sequences with the alternative allele; AC
my $notFixed = 1.0; # removes loci fixed for alt; AF
my $bqrs = 3; # maximum absolute value of the base quality rank sum test; BaseQRankSum
my $mqrs = 2.5; # maximum absolute value of the mapping quality rank sum test; MQRankSum
my $rprs = 2; # maximum absolute value of the read position rank sum test; ReadPosRankSum
my $qd = 2; # minimum ratio of variant confidenct to non reference read depth; QD
my $mq = 30; # minimum mapping quality; MQ
my $miss = 307; # maximum number of individuals with no data
##### this set is for whole genome shotgun data
The file filter_log has information on what was filtered why (missing is a big one, maybe because two runs were combined).
2. I then got depth information for these SNPs and ran filterSomeMore.pl to get rid of SNPs within < 3 bp of each other and with more than the mean + 2 sd coverage.
cat filtered2x_lycTimeSeriesVariants.vcf | perl -p -i -e 's/^.*PL\s+//' | perl -p -i -e 's/.\/.:\d+,\d+:(\d+)\S+/\1/g' | perl -p -i -e 's/\S+:(\d+\s+)/$1/g' > depth.txt
grep -v ^# depth.txt > temp
mv temp depth.txt
## mean cov. is ~11x
grep ^Sc filtered2x_lycTimeSeriesVariants.vcf | cut -f 1,2 | perl -p -i -e 's/Scaffold_(\d+);HRSCAF_\d+\s+/\1 /' > snps.txt
perl ../Scripts/filterSomeMore.pl filtered2x_lycTimeSeriesVariants.vcf > filter2_log
This generates the final variant set with 30,513 SNPs, morefilter_filtered2x_lycTimeSeriesVariants.vcf.