Post date: May 22, 2019 1:9:45 AM
Filtering was done in /uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_diversity/Traits/CHCs/Variants/. As noted in a previous post, there are several individuals with very few reads (NextSeq/small caterpillars?). Thus, I am being less stringent than normal for % missing data. Here is what I ran (in vcfFilter.pl):
#### stringency variables, edits as desired
my $minCoverage = 384; # minimum number of sequences; DP
my $minAltRds = 2; # minimum number of sequences with the alternative allele; AC
my $notFixed = 1.0; # removes loci fixed for alt; AF
my $mq = 30; # minimum mapping quality; MQ
my $miss = 96; # maximum number of individuals with no data
##### this set is for whole genome shotgun data
perl vcfFilter.pl lyc_chcs_samtbcft.vcf
Finished filtering lyc_chcs_samtbcft.vcf
Retained 80218 variable loci
I then wanted to see how many of the SNPs from the time series data were in this data set. The answer wasn't awesome. For the unfiltered CHC data it is:
perl countSnpMatch.pl lyc_chcs_samtbcft.vcf
Found 7356 out of 12886 SNPs
After this round of filtering we only have:
perl countSnpMatch.pl filtered2x_lyc_chcs_samtbcft.vcf
Found 2650 out of 12886 SNPs
I next ran filterSomeMore.pl on the filtered2x*vcf file. This removed SNPs within 2 bps of each other or those with excessively high coverage (> the man + 3sd). This left me with 64,195 SNPs, of which 2282 matched SNPs in the time series data set.