CHC variant calling

Post date: Jun 20, 2018 4:57:44 PM

I had very few SNPs pass my standard filtering particularly in terms of proportion of individuals with missing data. I suspected this was because some caterpillars generated little data, and this appears to be true (see below). So, from the variant subdirectory I ran vcfFilter.pl with the following options (note that this allows 50% missing data):

#### stringency variables, edits as desired

my $minCoverage = 384; # minimum number of sequences; DP

my $minAltRds = 2; # minimum number of sequences with the alternative allele; AC

my $notFixed = 1.0; # removes loci fixed for alt; AF

my $bqrs = 3; # maximum absolute value of the base quality rank sum test; BaseQRankSum

my $mqrs = 2.5; # maximum absolute value of the mapping quality rank sum test; MQRankSum

my $rprs = 2; # maximum absolute value of the read position rank sum test; ReadPosRankSum

my $qd = 2; # minimum ratio of variant confidenct to non reference read depth; QD

my $mq = 30; # minimum mapping quality; MQ

my $miss = 96; # maximum number of individuals with no data

##### this set is for whole genome shotgun data

I then grabbed the depth data (this was tricky with the vcf format):

grep ^Sc filtered2x_lyc_CHCs_variants.vcf | perl -p -i -e 's/^.+PL\s+//g' | perl -p -i -e 's/\S+:\d+,\d+:(\d+):\S+/$1/g' | perl -p -i -e 's/\.\/\.:\d+,\d+:(\d+)/$1/g' > depth.txt

perl -p -i -e 's/\S+[AGCT]\S+/0/g' depth.txt

perl -p -i -e 's/\t+/ /g' depth.txt

perl -p -i -e 's/\S+:\S+/0/g' depth.txt

Based on the data in depth.txt, 150 caterpillars had > 1X coverage, 124 had > 2X, and 30 had less than 0.5X. The mean across caterpillars was 23.5X (highly heterogeneous coverage). I want to go back and look at how this relates to weight, but for now I will likely use the 150 with > 1X coverage (entropy will hopefully help here). A file with the depth means by individual is meanDepthByInd.txt.

Then I filtered with filterSomeMore.pl as follows:

### stringency variables, edits as desired

my $maxCoverage = 11058; # maximum depth to avoid repeats, mean + 2sd

my $mind = 2; # don't call if SNPs are closer together than this

##### this set is for whole genome shotgun data

This gave me 42,350 SNPs in morefilter_filtered2x_lyc_CHCs_variants.vcf.

Page updated

Google Sites

Report abuse