Post date: Mar 11, 2017 4:7:33 AM
I finished a first pass at estimating selection in FHA (2011 to 2013), which I think I am mostly happy with. The files and results are in king:/uufs/chpc.utah.edu/common/home/gompert-group1/projecttimema_fluct/genomic_change_dark_morph/sel_fha/. I toyed with different ways to do this, but ended up going with an ABC approach as this made it easy to deal with the fact that two generations separate the samples and that hets. could be more or less fit than homozygotes (I could program a metropolis-hastings algorithm for this instead, but the two generation separation would make the MCMC pretty inefficient and the ABC was quite zippy). This also helped deal with the het excess, which makes HWE assumptions hard to justify.
The basic model is that we have relative fitnesses for m/m, g/m and g/g given by 1 - s (m/m), 1 (g/m) and 1 - t (g/g). This is nice and flexible: heterozygote advantage occurs when s and t are > 0, underdominance when s and t are < 0 and incomplete dominance when one (s or t) is greater > 0 and the other less than 0.
I assumed viability selection and that selection had already occurred in 2011. So, we have random mating to generate 2012 genotype frequencies. Then selection on genotypes given the model above. This is then followed by random mating again to get the 2013 initial allele frequencies and then selection again on genotypes to give the genotype frequencies we have in 2013.
I placed independent uniform priors on s and t bounded by -0.5 and 0.5, and ran a million simulations to estimate s and t with ABC.
Here is what I found:
1. . First, some raw results. In 2011 the frequency of the dark allele/haplotype was 0.317. In 2013 it was 0.357, consistent with the direction of phenotypic change. Thus, the dark allele increased in frequency. Genotype frequencies in 2011 were m/m = 0.042, m/g = 0.550 and g/g = 0.408. In 2013 they were m/m = 0.0897, m/g = 0.540 and g/g = 0.370. In both years we have an excess of hets (HW expectations were 0.433 and 0.461 in 2011 and 2013 respectively). The excess is not trivial. So, in a general sense, the excess hets within years suggests het. advantage and the change in dark allele frequency between years could be explained by directional selection or het. advantage assuming the equilibirium frequency of the dark allele is greater than the current frequency. Drift could also explain/contribute to the patterns (Ne is ~110).
2. Selection coefficient estimates are consistent with 1, but there is some uncertainty.
s: median = 0.24, mode = 0.34, 90% ETPIs = -0.17,0.46 , 95% ETPIs = -0.17,0.50
t: median = 0.25, mode = 0.27, 90% ETPIs = 0.014,0.42, 95% ETPIs = -0.015,0.46
We are pretty confident that g/g has a lower fitness than m/g (see summary of posterior for t from the previous e-mail, pp for t > 0 is 0.956). We are less confident about m/m (see posterior of s from the previous e-mail, pp for s > 0 is 0.851). Reduced confidence in the latter is due in part (mostly) to the fact that m/m just isn't very confident (it is hard to infer the fitness of rare genotypes). We could get around this by assuming an additive model (which was what I usually would do), but I don't think we can justify this (or want to do it) given the het excess. Another way to put this that gives maybe a cleaner numerical answer is that there is that the pp that g/g is the most fit genotype is 0.026 (thus, the pp that it is not the most fit is 0.9736). Likewise, the probability that m/g is the most fit is 0.828, and the probability that m/m is the most fit is 0.146. So, we know g/g sucks relative to g/m, and g/m is probably better than m/m but we can't be as sure about the latter (we also aren't fully sure about m/m vs g/g, see the next bit). One last way to do it, here are pp's for different fitness orders:
m/m < m/g < g/g = 0.028
m/m < g/g < m/g = 0.441
m/g < g/g < m/m = 0.004
m/g < m/m < g/g = 0.0174
m/m < g/g < m/g = 0.387
g/g < m/g < m/m = 0.128
3. Related to (2), I ran a quick check of how well the ABC procedure is working by analyzing 100 simulated data sets. There is little to no evidence of bias. Point estimates of s and t were clearly correlated with the true parameter values, but the correlation was better for t (r = 0.88) than for s (0.65). This again reflects the fact we have more information on the fitness of g/g individuals than m/m individuals.
4. Regarding green bugs getting wacked, just to be clear, we are distinguishing green from dark at mel, so "green" includes green and striped. True green bugs should be wacked on Adenostoma, but not the green-striped, which is what most, but not all of the "green" bugs should be.
5. Going for all six genotypes could be interesting, with some caveats. Right now, I am working with mel alone, we might than want to add stripey. Also, we know the signal in terms of genomic change varies between mel and stripey. There are thus downsides of combining them. Without combining them, I am not sure we want to fully think about the six genotypes (though even with just mel, we do see six clusters in the PCA for FHA... see attached plot where points are colored based on k-means cluster assignment which I used to define genotypes). Finally, with six genotypes we will have more parameters and fewer individuals per genotype (and Ne is modest and we essentially have a single replicate). We could fit a model for all six pretty easy if we assume additivity, but this isn't what we want. I suspect that trying to infer a separate fitness value (selection coefficient) for all six genotypes is going to be really pushing the data such that we would have considerable uncertainty in parameters. I am not saying we definitely shouldn't go this route, just that we need to think about it carefully.
Here is the R code and some extra notes:
N0<-500
N1<-602
L<-175918
g0<-matrix(scan("pntest_fha2011.txt",n=N0*L,sep=" "),nrow=L,ncol=N0,byrow=TRUE)
g1<-matrix(scan("pntest_fha2013.txt",n=N1*L,sep=" "),nrow=L,ncol=N1,byrow=TRUE)
ids<-read.table("../popgen_fha/snpids.txt",header=F)
cdat<-read.table("fha2013pheno.txt",header=TRUE)
## define mel and stripy
mlb<-4139489
mub<-6414835
slb<-6414836
mel<-c(which(ids[,2]==702.1 & ids[,3] < mlb),
which(ids[,2]==128 & ids[,3] < mub))
stripy<-which(ids[,2]==128 & ids[,3] >= slb)
##
g0mel<-g0[mel,]
g1mel<-g1[mel,]
## genetic covariance matrix
M0mel<-matrix(NA,nrow=N0,ncol=N0)
for(i in 1:N0){
for(j in i:N0){
M0mel[i,j]<-cov(g0mel[,i],g0mel[,j])
M0mel[j,i]<-M0mel[i,j]
}
}
pc0<-prcomp(M0mel,center=TRUE,scale=FALSE)
pc0x<-prcomp(t(g0mel),center=TRUE,scale=FALSE)
pc1x<-prcomp(t(g1mel),center=TRUE,scale=FALSE)
o0<-kmeans(pc0x$x[,1:2],centers=3) ## 1 = gr/gr, 2 = gr/br, 3 = br/br
plot(pc0x$x[,1],pc0x$x[,2],col=c("green","blue","brown")[o0$cluster])
pcall<-prcomp(rbind(t(g0mel),t(g1mel)),center=TRUE,scale=FALSE)
oall<-kmeans(pcall$x[,1:2],centers=3) ## 1 = gr/gr, 3 = gr/br, 2 = br/br
plot(pcall$x[,1],pcall$x[,2],col=c("forestgreen","brown","lightgreen")[oall$cluster])
gen0<-oall$cluster[1:500]
gen1<-oall$cluster[-c(1:500)]
pmel0<-(sum(gen0==2) + 0.5 * sum(gen0==3))/(length(gen0))
pmel1<-(sum(gen1==2) + 0.5 * sum(gen1==3))/(length(gen1))
## mel in gen 0
## q = f(mel) = 0.317
## gen freq: m/m = 0.042, m/g = 0.550, g/g = 0.408
## hw exp: m/m = 0.100, m/g = 0.433, g/g/ = 0.466
## mel in gen 1
## q = f(mel) = 0.357
## gen freq: m/m = 0.0897, m/g = 0.540, g/g = 0.370
## hw exp: m/m = 0.129, m/g = 0.461, g/g/ = 0.410
## abc analysis
onesim<-function(W,Gfreq,Ne){
## p at start of next gen
p<-Gfreq[1] + 0.5 * Gfreq[2]
## starting genotype frequencies
q<-1-p
Gfreq<-c(p^2,2*p*q,q^2)
## post selection expected gen freqs
Gfreq <-Gfreq * W/wbar
## actual
Gcnts<-rmultinom(n=1,size=Ne,prob=Gfreq)
Gfreq<-Gcnts/Ne
## p at start of final gen
p<-Gfreq[1] + 0.5 * Gfreq[2]
## starting genotype frequencies
q<-1-p
Gfreq<-c(p^2,2*p*q,q^2)
## post selection expected gen freqs
Gfreq <-Gfreq * W/wbar
## actual
Gcnts<-rmultinom(n=1,size=Ne,prob=Gfreq)
Gfreq<-Gcnts/Ne
return(t(Gfreq))
}
abcloop<-function(N=1000,Ne=200,Gfreq=Gfr0,lb=-0.5,ub=0.5){
simfr<-matrix(NA,nrow=N,ncol=3)
st<-matrix(NA,nrow=N,ncol=2)
for(x in 1:N){
st[x,]<-runif(n=2,min=lb,max=ub)
W<-c(1-st[x,1],1,1-st[x,2])
simfr[x,]<-onesim(W=W,Gfreq=Gfreq,Ne=Ne)
}
out<-list(st,simfr)
return(out)
}
aout<-abc(target=as.vector(Gfr1),param=as.matrix(oo[[1]]),sumstat=as.matrix(oo[[2]]),tol=0.005,method="ridge")
> summary(aout)
#Call:
#abc(target = as.vector(Gfr1), param = as.matrix(oo[[1]]), sumstat = as.matrix(oo[[2]]),
tol = 0.005, method = "ridge")
#Data:
# abc.out$adj.values (5000 posterior samples)
#Weights:
# abc.out$weights
#
# P1 P2
#Min.: -0.6025 -0.3328
#Weighted 2.5 % Perc.: -0.2674 -0.0333
#Weighted Median: 0.2409 0.2509
#Weighted Mean: 0.2104 0.2372
#Weighted Mode: 0.3447 0.2701
#Weighted 97.5 % Perc.: 0.4814 0.4426
#Max.: 0.5047 0.4997
HPDinterval(obj=as.mcmc(aout$adj.values[,2]),prob=0.95)
# lower upper
#var1 -0.01521676 0.4591293
HPDinterval(obj=as.mcmc(aout$adj.values[,1]),prob=0.95)
# lower upper
#var1 -0.1738543 0.5047007
## check quality
est<-matrix(NA,nrow=100,ncol=2)
for(i in 1:100){
xx<-abc(target=as.vector(oo[[2]][i,]),param=as.matrix(oo[[1]]),sumstat=as.matrix(oo[[2]]),tol=0.005,method="ridge")
est[i,1]<-median(xx$adj.values[,1])
est[i,2]<-median(xx$adj.values[,2])
}
cor.test(oo[[1]][1:100,1],est[,1],xlim=c(-.2,.2),ylim=c(-.2,.2))
Pearson's product-moment correlation
data: oo[[1]][1:100, 1] and est[, 1]
t = 8.4974, df = 98, p-value = 2.178e-13
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5216452 0.7515928
sample estimates:
cor
0.6513279