Variant filtering with Samtools CHC SNPs

Post date: May 22, 2019 1:9:45 AM

Filtering was done in /uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_diversity/Traits/CHCs/Variants/. As noted in a previous post, there are several individuals with very few reads (NextSeq/small caterpillars?). Thus, I am being less stringent than normal for % missing data. Here is what I ran (in vcfFilter.pl):

#### stringency variables, edits as desired

my $minCoverage = 384; # minimum number of sequences; DP

my $minAltRds = 2; # minimum number of sequences with the alternative allele; AC

my $notFixed = 1.0; # removes loci fixed for alt; AF

my $mq = 30; # minimum mapping quality; MQ

my $miss = 96; # maximum number of individuals with no data

##### this set is for whole genome shotgun data

perl vcfFilter.pl lyc_chcs_samtbcft.vcf

Finished filtering lyc_chcs_samtbcft.vcf

Retained 80218 variable loci

I then wanted to see how many of the SNPs from the time series data were in this data set. The answer wasn't awesome. For the unfiltered CHC data it is:

perl countSnpMatch.pl lyc_chcs_samtbcft.vcf

Found 7356 out of 12886 SNPs

After this round of filtering we only have:

perl countSnpMatch.pl filtered2x_lyc_chcs_samtbcft.vcf

Found 2650 out of 12886 SNPs

I next ran filterSomeMore.pl on the filtered2x*vcf file. This removed SNPs within 2 bps of each other or those with excessively high coverage (> the man + 3sd). This left me with 64,195 SNPs, of which 2282 matched SNPs in the time series data set.

Page updated

Google Sites

Report abuse