Post date: Mar 27, 2017 3:34:50 AM
I re-ran the LDA analyses to test for recent migrants, that is, individuals that appear to have come from a different population that the population from which they were sampled. The main difference from what Sam did earlier is that I used genotype estimates from entropy rather than from population-specific models. This means I used fewer SNPs (because I ran entropy on common SNPs) and that inferences accounted for uncertainty in which group/population/cluster an individual belonged too. The latter is more conservative than what we did before. Also, I considered all pairs of populations, but focused on the details for those you did.
Here are the key results:
Across all 300 population pairs (all pairs for the 25, which excludes ABM) the mean assignment probability to the correct population (whichever one that was) was 0.9918 (most individuals were confidently assigned to the population from which they were sampled). ldaSortedAssign.pdf (attached) shows the mean assignment probability to the correct/collected population across all 300 pairs sorted from lowest to highest. Similarly, ldaAssignDist.pdf shows this as a function of log geographic distance with whether same or different host populations are considered. Note that average assignments to the collected populations were very similar for same (0.964, sd = 0.0524) and different (0.984, sd = 0.953) host comparisons.
Very few individuals were confidently (assignment prob. > 0.9) assigned to the alternative population (the one they were not sampled from): 0 in 221 population pairs, 1 in 66 pairs, and 2 in 13 pairs (we never had more than two individual assigned to the population they were not collected from). ldaAssigns.pdf shows individual assignment probabilities for four population pairs on different hosts that were nearby (the ones you used before). Two of them (a and c) show all individuals confidently assigned to the population they were collected from. In one (b), there is much more uncertainty in general (the populations must be quite similar genetically), but two likely migrants. And in the last one (d), which is perhaps the most interesting, we have a single individual that is most likely a migrant from SUV (or a similar population) to SLA.
The results and R code (migLda.R) are in /uufs/chpc.utah.edu/common/home/u6000989/projects/lmelissa_hostAdaptation/demog/.
Here is the R code:
N<-525
L<-14051
g<-matrix(scan("mngen_k2to5.txt",n=N*L,sep=" "),nrow=L,ncol=N,byrow=TRUE)
Gm<-matrix(NA,nrow=525,ncol=525)
for(i in 1:525){
for(j in i:525){
Gm[i,j]<-cov(g[,i],g[,j])
Gm[j,i]<-Gm[i,j]
}
}
pc<-prcomp(Gm,center=TRUE,scale=FALSE)
#Importance of components:
# PC1 PC2 PC3 PC4 PC5 PC6 PC7
#Standard deviation 0.7831 0.21021 0.11202 0.06006 0.04776 0.04338 0.03086
#Proportion of Variance 0.8620 0.06212 0.01764 0.00507 0.00321 0.00265 0.00134
#Cumulative Proportion 0.8620 0.92417 0.94181 0.94688 0.95008 0.95273 0.95407
# DA on all pairs
ids<-read.table("melIds.txt",header=FALSE)
pops<-as.character(unique(ids[,1]))
ids<-as.character(ids[,1])
pops<-pops[-2] ## drop ABM
dat<-read.table("hosts-geo.txt",header=T)
host<-as.numeric(dat$host=='medicago')
library(fields)
library(MASS)
z<-0
Z<-(25*24)/2
ldapops<-cbind(rep(NA,Z),rep(NA,Z))
ldares<-matrix(NA,nrow=Z,ncol=3)
ldadat<-matrix(NA,nrow=Z,ncol=2)
for(i in 1:24){
for(j in (i+1):25){
z<-z+1
ldapops[z,1:2]<-c(pops[i],pops[j])
p1<-which(ids==pops[i])
p2<-which(ids==pops[j])
x<-pc$x[c(p1,p2),1:4]
gr<-c(rep(1,length(p1)),rep(2,length(p2)))
o<-lda(x=x,grouping=gr,CV=TRUE)
mncorrect<-mean(c(o$posterior[gr==1,1],o$posterior[gr==2,2]),na.rm=TRUE)
nmiss<-sum(o$posterio[gr==1,1] < 0.1,na.rm=TRUE)+sum(o$posterio[gr==2,2] < 0.1,na.rm=TRUE)
ntot<-sum(is.na(o$posterior[,1])==FALSE)
ldares[z,1:3]<-c(mncorrect,nmiss,ntot)
ldadat[z,1]<-host[i]+host[j]
d<-rdist.earth(as.matrix(dat[c(i,j),c(4,3)]),miles=FALSE)
ldadat[z,2]<-d[1,2]
}
}
mean(ldares[,2]/ldares[,3])
#0.008154622
median(ldares[,2]/ldares[,3])
#0
#### plots ####
## mean prob. of assignment to the correct population by distance and sorted
cs<-ldadat[,1]
cs[cs==0]<-2
cs<-c("orange","darkgray")[cs]
pdf("ldaAssignDist.pdf",width=5.5,height=5.5)
par(mar=c(5.5,5.5,0.5,0.5))
plot(log(ldadat[,2]),ldares[,1],pch=19,xlab="Log distance (km)",ylab="Mean assignment prob.",cex.lab=1.5,cex.axis=1.2,col=cs)
legend(6,0.56,c("same host","different host"),pch=19,col=c("darkgray","orange"))
dev.off()
pdf("ldaSortedAssign.pdf",width=5.5,height=5.5)
par(mar=c(5.5,5.5,0.5,0.5))
plot(1:300/300,sort(ldares[,1]),pch=19,xlab="Quantile",ylab="Mean assignment prob.",cex.lab=1.5,cex.axis=1.2)
dev.off()
# DA on pairs of pops using first 3 pcs
# Sam did this with the folowing pairs
# vcp x sla
# wal x gvl
# wal x vcp
# svc x vcp
# svc x sla
pps<-matrix(c("VCP","SLA","GVL","WAL","VCP","WAL","SUV","SLA"),nrow=4,ncol=2,byrow=TRUE)
tit<-c("(a) VCP-SLA","(b) GVL-WAL","(c) VCP-WAL","(d) SUV-SLA")
pdf("ldaAssigns.pdf",width=7,height=11)
par(mfrow=c(4,1))
par(mar=c(4,5,1.5,0.5))
for(a in 1:4){
p1<-which(ids==pps[a,1])
p2<-which(ids==pps[a,2])
x<-pc$x[c(p1,p2),1:4]
gr<-c(rep(1,length(p1)),rep(2,length(p2)))
o<-lda(x=x,grouping=gr,CV=TRUE)
barplot(t(o$posterior),names.arg=gr,cex.names=0.8,horiz=FALSE,beside=FALSE,ylab="Assignment probability",cex.lab=1.5,cex.axis=1.2)
title(main=tit[a],adj=0,cex.main=1.2)
if(a==4){mtext("Population",1,line=2.8)}
}
dev.off()
save(list=ls(),file="lmelLda.rdat")