I am doing Fst analysis for the manuscript as suggested by the reviewers. There are various Fst estimators. I will use the Hudson's Fst estimator. Definition from Bhatia et al 2013 (https://www.ncbi.nlm.nih.gov/pubmed/23861382). Hudson (Hudson et al. 1992) defined F ST in terms of heterozygosity. The fundamental difference between these estimators is that for Hudson, the total variance is based upon the ancestral population and not the current sample. The equation for this estimator is Fst = 1 - (Hw/Hb) where Hw is the mean number of differences within populations, and Hb is the mean number of differences between populations. Refer to equation 10 in the paper for the actual formula using allele frequencies and sample size. Remember for diploid organism, sample size = 2(number of samples genotypes) since they have 2 copies of the chromosome. The information I need for this analysis is allele frequencies (p) for each SNP for the populations I need to do the pairwise comparison for. I also need the sample size (n) for each SNP for each population. We did this analysis for the following pairwise comparisons : 1). GLA and SLA 2).GLA and nearby native 3). SLA and nearby medicago. So I created an input file containing columns: scaffold, position, Pgla, Psla, Pnative.
Before getting into the real analysis, I created a final file which could be used for the analysis. The R code for this is saved in the same folder as prepareinputfile.R. here are the list of files in the folder:
melissaAll_allelefreqs.txt = file with allele frequencies of all populations
hudsonsFst.R = R code for the analysis
fst_gla_abc.txt, fst_gla_sla.txt, fst_sla_vcp.txt = fst files with scaffold position and numerator and denominator of hudson's equation and Fst for the population pairs
Outcombine_surv_final.csv, Outcombine_wgt_final.csv = Survival experiment data files from experimental comparison folder
fst_survivaldata_combine.py, fst_weightdata_combine.py = python script to combine the fst and survival and weight data files
Outcombine_surv_fst_glaabc.csv, Outcombine_surv_fst_glasla.csv, Outcombine_surv_fst_slavcp.csv = Output files of the survival pythons script above
Outcombine_wgt_fst_glaabc.csv, Outcombine_wgt_fst_glasla.csv, Outcombine_wgt_fst_slavcp.csv = Output files of the weight pythons script above
prepareinputfile.R = R code to prepare a final input file for the analysis
weight_fst.txt, survival_fst.txt = weight and survival files with fst for each pop pair. THESE ARE THE FILES USED IN THE ANALYSIS
The folder for this analysis is /uufs/chpc.utah.edu/common/home/u6007910/projects/lmelissa_hostAdaptation/fstCalculations
#### CODE ##############
##read in the allele frequency file for all populations
allfreq <- read.table("melissaAll_allelefreqs.txt", header=T)
##create input file of analysis
fst.in <- cbind(allfreq$scaffold, allfreq$position, allfreq$GLA, allfreq$SLA, allfreq$ABC, allfreq$VCP)
colnames(fst.in) <- c("scaffold","position","Pgla","Psla","Pabc","Pvcp")
##calculate sample size (N) for each population
n.gla <- rep(400, 206028) #20 individuals, 2*20
n.sla <- rep(324, 206028) #18 individuals
n.abc <- rep(361, 206028) #19 individuals
n.vcp <- rep(400, 206028) #20 individuals
##combining the sample sizes to the input matrix for fst above
fst.in <- cbind(fst.in, n.gla, n.sla, n.abc, n.vcp)
#change from atomic vector to data frame to subset columns
fst.in <- data.frame(fst.in)
## function to calculate hudson Fst
hudsonFst<-function(p1=NA,p2=NA,n1=NA,n2=NA){
numerator<-(p1 - p2)^2 - ((p1 * (1 - p1))/(n1 - 1)) - ((p2 * (1 - p2))/(n2 - 1))
denominator<-p1 * (1 - p2) + p2 * (1 - p1)
fst<-numerator/denominator
out<-cbind(numerator,denominator,fst)
return(out)
}
##compare gla and sla
gla.sla <- hudsonFst(p1=fst.in$Pgla, p2=fst.in$Psla, n1 =fst.in$Ngla, n2=fst.in$Nsla)
##compare gla and abc(native)
gla.abc <- hudsonFst(p1=fst.in$Pgla, p2=fst.in$Pabc, n1 =fst.in$Ngla, n2=fst.in$Nabc)
##compare sla and vcp(medicago)
sla.vcp <- hudsonFst(p1=fst.in$Psla, p2=fst.in$Pvcp, n1 =fst.in$Nsla, n2=fst.in$Nvcp)
#convert to data frame
gla.sla <- data.frame(gla.sla)
gla.abc <- data.frame(gla.abc)
sla.vcp <- data.frame(sla.vcp)
##combine the data frame above with scaffold position
gla.sla.pos <- cbind(fst.in$scaffold, fst.in$position, gla.sla)
gla.sla.pos <- data.frame(gla.sla.pos)
gla.abc.pos <- cbind(fst.in$scaffold, fst.in$position, gla.abc)
gla.abc.pos <- data.frame(gla.abc.pos)
sla.vcp.pos <- cbind(fst.in$scaffold, fst.in$position, sla.vcp)
sla.vcp.pos <- data.frame(sla.vcp.pos)