Post date: Dec 29, 2013 9:35:58 PM
I want to estimate the folded site frequency spectrum for each Lycaeides population. This is a nice summary of genetic diversity, was requested by reviewers of the admixture paper, can be compared to the standard neutral model (e.g. Tajima's D = (pi - Theta_w) / C), and will provide preliminary data for my fluctuating selection proposal.
The bam and reference sequence files for the admixture project are in node5:/node5raid/assem/lycaeides/lycaeides_gbs/gbs_to_gbsref_5v13/. Thus, I am running these analyses on node5 for the moment, but I will want to collect these results and data and move them to the dorc cluster sometime.
First I need to generate a bcf file for each population (e.g. for BTB),
samtools mpileup -I -g -q 20 -Q 13 -f lycaeides_gbscontigs_reference.fasta *BTB*bam > BTBsites.bcf
Then I generate an initial estimate of the SFS,
bcftools view -cGP cond2 BTBsites.bcf > /dev/null 2> BTBsites.1.afs
And then I iterate the EM algorithm (20 times) to infer the SFS,
bcftools view -cGP BTBsites.N.afs BTBsites.bcf > /dev/null 2> BTBsites.N+1.afs
I am using a perl script (sfsEM.pl) to iterate this process over all 66 populations with fork. The output files include the SFS (unfolded, and split across sets of bases that will need to be combined), heterozygosity (pi or Theta_pi) and Theta_w.
I tried a second approach to estimate Tajima's D (for each GBS contig) with a program called Analysis of Next Generation Sequencing Data (ANGSD) which is described in part in Korneliussen et al. 2013 [doi:10.1186/1471-2105-14-289]. This program does many other things and could be of general use. I have a copy installed on node5. This includes a main program angsd in ~/bin/ and accessory programs in ~/angsd0.570/misc/. Here is what I did for BTB (I am not sure whether this is completely correct as the on-line manual is out of date, also the difference between Theta_w and Theta_pi is in the opposite direction of what I have from samtools):
## estimate sfs
angsd -bam blist -out ./egsfs -doSaf 1 -fold 1 -anc lycaeides_gbscontigs_referrence.fasta -GL 2
~/angsd0.570/misc/emOptim2 egsfs.saf 46 -maxIter 100 -P 16 > egsfs.saf.em.ml
## estimate thetas
angsd -bam blist -out ./egsfs -doThetas 1 -doSaf 1 -fold 1 -pest egsfs.saf.em.ml -anc lycaeides_gbscontigs_reference.fasta -GL 2
## neutrality tests
~/angsd0.570/misc/thetaStat make_bed egsfs.thetas.gz
~/angsd0.570/misc/thetaStat do_stat egsfs.thetas.gz -nChr 92
The main results are in egsfs.thetas.gz.pestPG which is in node5:/node5raid/assem/lycaeides/lycaeides_gbs/gbs_to_gbsref_5v13/tajima. Again, I am not really sure that this is right, but this method and program are worth looking into more in the future. For now I am just going to report genome-wide estimates of pi, theta_w and the sfs from samtools/bcftools for the revisions and grant proposal.