Post date: Oct 03, 2014 7:25:53 PM
Population-specific genotype likelihood files are in /labs/evolution/projects/timema_radiation/. I used the addHdr* scripts in the same directory to add the number of individuals and loci to the file.
I wrote a new C++ program that uses a soft EM algorithm to obtain ML allele frequency estimates. The program is call estpEM. Here is a summary from the README fils:
#######################################################
Program to obtain ML allele frequency estimates from genotype likelihoods using a soft EM algorithm, see equation 5 in doi:10.1093/bioinformatics/btr509 for mathematical details.
GSL is required. To compile try,
g++ -O3 -o estpEM main.C func.C -lm -lgsl -lgslcblas
Type 'estpEM' for command line options.
Provide the infile in genotype likelihood (gl) format. A header row with the number of individuals and loci is required, e.g.,
50 10000
Additional header rows are allowed but must be indicated with the -h option. After the header row(s), each row contains the phred-scaled genotpye likelihoods for one locus, which are preceded by a locus id. There should be three genotype likelihoods per individual: RR, RA, AA, where R and A are the reference and alternate allele respectively.
A single output file is generated. This has one row per locus with the locus id, an initial allele frequency estimate, and the EM ML estimate. The initial allele frequency estimate is 1/2n (Sum_n Sum_g g * L(g)).
########################################################
I used this program with the following options to obtain allele frequency estimates for each population (these are non-reference allele frequency).
cd /labs/evolution/projects/timema_radiation/popgen/
estpEM -i ../variants/timemaBMCG3_Q.gl -o p_timemaBMCG3_Q.txt -e 0.0001 -m 20 -h 0
The allele frequencies are in the p_*txt files in /labs/evolution/projects/timema_radiation/popgen/.
I used a series of R scripts (one for each of 10 population comparisons) to estimate Hudson's Fst for each locus (see /labs/evolution/projects/timema_radiation/popgen/scripts/fst*R). Figures showing the distribution of Fst across the genome are in /labs/evolution/projects/timema_radiation/popgen/results/*png and R workspaces for each population are in the same directory. Here is a summary of the Fst distribution (from fsts.txt):
pop fsthat min q1st med mean q3rd max
BCBOG_CxBCBOG_Q 0.0807 0 0 0 0.0111 8e-04 0.9055
BCTUR_CxBCTUR_P 0.2244 0 0 1e-04 0.0761 0.1407 0.9995
BMCG3_ICxBMCG3_Q 0.064 0 0 0 0.0134 3e-04 0.9458
BMT_CxBMT_Q 0.0646 0 0 0 0.013 4e-04 0.9844
BS_CxBS_Q 0.0629 0 0 0 0.0069 7e-04 0.9374
CR_CxCR_CY 0.0755 0 0 0 0.0062 2e-04 0.9996
HFTP_CxHFRS_Q 0.2539 0 0 1e-04 0.0738 0.1338 0.9999
LP_DFxLP_Q 0.8434 0 0 0 0.1493 0.0014 1
SM_QxSM_RW 0.8288 0 0 1e-04 0.1487 0.0077 1
VP_CxVP_Q 0.0644 0 0 0 0.0054 2e-04 0.7558