Variance Ne with GATK vs. samtools/bcftools

Post date: Apr 17, 2019 11:0:33 PM

I estimated variance effective population sizes with the Bayesian bootstrap method I have used previously (my own program, varne). All of the relevant files are in /uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_diversity/Ne. I used the MLE allele frequencies for this.

First, I generated infiles for varne with mkNeInfiles.pl. Then, I ran runNeEst.pl (a wrapper script) which generates all of the Ne* files (posterior samples) and the NeLog_13-17.txt log file with estimates. This is based on change between 2013 and 2017 (I also ran 2013-2015 and 2015-2017, but the values are smaller... I think the time period is too short). I did all of this for all three SNP sets/genotype likelihoods.

Here is runNeEst.pl:

#!/usr/bin/perl

# then subset of files from a given set

foreach $in13 (@ARGV){

if ($in13 =~ m/full/){

$L = 30033;

}

else{

$L = 12886;

}

$in13 =~ m/in_([A-Z]+)\-\d+_([A-Z]+_[a-z]+)/;

$pop = $1;

$set = $2;

$in17 = $in13;

$in17 =~ s/13/17/ or die "failed sub $in13\n";

system "varne -a $in13 -b $in17 -l $L -t 4 -n 3000 -x 1000 > Ne_"."$set"."_"."$pop"."_13-17.txt\n";

}

Note that this runs 1000 bootstrap replicates and assumes a census size of about 3000. Then I ran parseNeLog.pl to extract the posterior medians and 95% ETPIs for each population and data set. These are in SummaryNe.txt. Different data sets give highly correlated but different estimates of Ne. In particular, the estimates from samtools genotype likelihoods are higher, which effectively means estimates of allele frequencies across time for these are more similar. This makes me generally believe them more (chance/error shouldn't make thigs more similar).

I made a simple plot of Ne with ETPIs with neSums.R. The plot is varNe.pdf.

Page updated

Google Sites

Report abuse