Post date: Jun 20, 2018 4:57:44 PM
I had very few SNPs pass my standard filtering particularly in terms of proportion of individuals with missing data. I suspected this was because some caterpillars generated little data, and this appears to be true (see below). So, from the variant subdirectory I ran vcfFilter.pl with the following options (note that this allows 50% missing data):
#### stringency variables, edits as desired
my $minCoverage = 384; # minimum number of sequences; DP
my $minAltRds = 2; # minimum number of sequences with the alternative allele; AC
my $notFixed = 1.0; # removes loci fixed for alt; AF
my $bqrs = 3; # maximum absolute value of the base quality rank sum test; BaseQRankSum
my $mqrs = 2.5; # maximum absolute value of the mapping quality rank sum test; MQRankSum
my $rprs = 2; # maximum absolute value of the read position rank sum test; ReadPosRankSum
my $qd = 2; # minimum ratio of variant confidenct to non reference read depth; QD
my $mq = 30; # minimum mapping quality; MQ
my $miss = 96; # maximum number of individuals with no data
##### this set is for whole genome shotgun data
I then grabbed the depth data (this was tricky with the vcf format):
grep ^Sc filtered2x_lyc_CHCs_variants.vcf | perl -p -i -e 's/^.+PL\s+//g' | perl -p -i -e 's/\S+:\d+,\d+:(\d+):\S+/$1/g' | perl -p -i -e 's/\.\/\.:\d+,\d+:(\d+)/$1/g' > depth.txt
perl -p -i -e 's/\S+[AGCT]\S+/0/g' depth.txt
perl -p -i -e 's/\t+/ /g' depth.txt
perl -p -i -e 's/\S+:\S+/0/g' depth.txt
Based on the data in depth.txt, 150 caterpillars had > 1X coverage, 124 had > 2X, and 30 had less than 0.5X. The mean across caterpillars was 23.5X (highly heterogeneous coverage). I want to go back and look at how this relates to weight, but for now I will likely use the 150 with > 1X coverage (entropy will hopefully help here). A file with the depth means by individual is meanDepthByInd.txt.
Then I filtered with filterSomeMore.pl as follows:
### stringency variables, edits as desired
my $maxCoverage = 11058; # maximum depth to avoid repeats, mean + 2sd
my $mind = 2; # don't call if SNPs are closer together than this
##### this set is for whole genome shotgun data
This gave me 42,350 SNPs in morefilter_filtered2x_lyc_CHCs_variants.vcf.