Post date: May 29, 2015 2:5:38 PM
Zach calculated the mean number of reads per SNP for each individual. Overall, coverage is great with an average of 5.9 reads per individual per locus. But, this varies for wild vs. captive (see that attached boxplots, which show the distribution of mean coverage across the individuals in each population/captive generation). Coverage is clearly lower in the wild populations, and quite a bit so. What to do about this...
The plan from here:
-Need to fix discrepancy in coverage for PCA and kinship
-theta and pi, estimated from samtools, are ok: These estimates are probably not as good as they would be with higher sample sizes, but they shouldn't be biased. They properly account for uncertainty in genotypes and allele frequencies, and even in whether
each nucleotide position is variable. They then report the most likely parameter value rather than the mean or median of the posterior, which should be less likely to be affected by sample size. But, 3 and 5 are still low sample sizes (3 in particular) so these estimates are not as robust as those for other populations (i.e. adding a few more individuals from the same population could shift them up or down).
-For Gst and expected heterozygosity and proportion of rare alleles, estimated from sample allele frequencies, drop SG and UP and run a population allele frequency model to incorporate uncertainty in allele frequencies.
Getting higher coverage loci:
perl getIndXLocusCoverage.pl filtered_varEsosorum.vcf
#In R:
> cv<-read.table("indXlocCov.txt",header=TRUE)
> cap<-2:187
> wild<-188:259
> capCv<-apply(cv[,cap],1,mean)
> wildCv<-apply(cv[,wild],1,mean)
> sum(capCv > 5 & wildCv > 5)
1543 loci with 5x coverage (we are more 95% confident we will get heterozygotes instead of calling them homozygotes; 0.5^5)
> which(capCv > 5 & wildCv > 5)
> tokeep<-which(capCv > 5 & wildCv > 5)
> write.table(tokeep,"hicovindexes.txt",row.names=F,col.names=F,quote=F)
#Then added this to PCA and kinship R code:
hic<-scan("hicovindexes.txt")
g<-g[hic,]