spatial autocorrelation of allele frequencies

Post date: Sep 24, 2013 2:51:17 PM

I want to quantify spatial autocorrelation in allele frequencies and contrast patterns of spatial autocorrelation among common, low frequency, and rare variants. It would also be interesting to contrast patterns of spatial autocorrelation for putatively admixed vs. non-admixed populations, but the spatial scale and number of sampled populations for admixed entities is quite different than for the non-admixed populations, so this might not be very meaningful. I spent a few days trying to figure out the best way to measure spatial autocorrelation. The two main approaches are to calculate the allele frequency correlation (Moran's I) or semivariance (variogram) at different spatial lags (see Wagner et al. 2005; doi:10.1534/genetics.104.036038 for details). The variogram approach allows parametric models to be fit to the pattern of spatial autocorrelation, but I don't think this is relevant for what I want to do. And Moran's I has a bit more intuitive interpretation. So, I decided to use Moran's I. There is a bit of confusion how to combine information across loci for an 'overall' Moran's I. I decided to follow Eckert et al (2010; doi: 10.3732/ajb.0900099) and simply calculate the mean Moran's I over loci. I then test for the significance of the correlation at each lag by permuting the population locations. I wrote my own R function for this, as available functions either do not combine information for multiple characters or scale the data in a way that I don't think is appropriate when doing so.

I have run some initial analyses of spatial autocorrelation while playing with bin sizes and different groups of populations. I think we really only have sufficient sampling (spatial distribution and number of populations) to quantify spatial autocorrelation for all populations, melissa, and idas (all idas, including longinus), but I want to look at this a bit more. In general I am pretty happy with 75 km bins, but I am trying an analysis with the first three bins being 25 km for all populations. This is still running. But the current results indicate that (at least for all, idas, and melissa) there is positive spatial autocorrelation at short distances, negative spatial autocorrelation at high distances (consistent with clinal structure) and higher correlations for common than low frequency than rare variants (moran.pdf). Here is the R code I used (this is called moranAnalysis.R, but I will likely edit it more for the final figures):

library(fields)

nind<-1521

nloc<-15069 ## even though some data sets have 15076

source("correlog.R") ## read function for Moran's I correlogram with significance test

## read data

gcommon<-matrix(scan("../admixprops/results/mngen_commonk2-8.txt",n=nind*nloc,sep=" "),nrow=nloc,ncol=nind,byrow=T)

glowf<-matrix(scan("../admixprops/results/mngen_lowfk2-8.txt",n=nind*nloc,sep=" "),nrow=nloc,ncol=nind,byrow=T)

grare<-matrix(scan("../admixprops/results/mngen_rarek2-8.txt",n=nind*nloc,sep=" "),nrow=nloc,ncol=nind,byrow=T)

pops<-read.table("pops.txt",header=FALSE)

npops<-length(unique(pops[,1]))

## calculate afs

afcommon<-matrix(NA,nrow=nloc,ncol=npops)

aflowf<-afcommon

afrare<-afcommon

for(i in 1:nloc){

afcommon[i,]<-tapply(X=gcommon[i,],INDEX=pops[,1],mean)/2

aflowf[i,]<-tapply(X=glowf[i,],INDEX=pops[,1],mean)/2

afrare[i,]<-tapply(X=grare[i,],INDEX=pops[,1],mean)/2

}

## read coordinates

poploc<-read.table("admixpoplocs.txt",header=F)

dmatrix<-rdist.earth(poploc[,2:3],miles=FALSE) ## cacluate the Great circle (geographic) distance, all pairs

## Correlogram, Moran's I all populations

## read groups

grps<-read.table("groups.csv",sep=",",header=T)

A<-vector(mode='list',length=7)

gnams<-c("all","mel","idas","anna","alp","idasN","long")

names(A)<-gnams

## all

A$all<-grps$idas > -1

## melissa

A$mel<-grps$mel == 1

## idas

A$idas<-grps$idas == 1

## anna

A$anna<-grps$anna == 1

## alp

A$alp<-grps$alp == 1

## idasN

A$idasN<-grps$idasN == 1

## long

A$long<-grps$long == 1

## chose bins

bins<-vector(mode='list',length=7)

numbins<-c(20,14,14,5,5,3,5)

for (n in 1:7){

a<-dmatrix[A[[n]],A[[n]]][upper.tri(dmatrix[A[[n]],A[[n]]])]

bins[[n]]<-seq(0,ceiling(max(a)),75)

#bins[[n]]<-seq(floor(min(a)),ceiling(max(a)),(ceiling(max(a))-floor(min(a)))/numbins[n])

#bins[[n]]<-quantile(a,probs=seq(0,1,1/numbins[n]))

}

moranCommon<-vector(mode='list',length=7)

names(moranCommon)<-gnams

moranLowf<-vector(mode='list',length=7)

names(moranLowf)<-gnams

moranRare<-vector(mode='list',length=7)

names(moranRare)<-gnams

## loop through subsets

nperm<-1000

for (n in 1:7){

moranCommon[[n]]<-moran(latlong=poploc[A[[n]],2:3],afcommon[,A[[n]]],bins[[n]],np=sum(A[[n]]),perm=nperm)

moranLowf[[n]]<-moran(latlong=poploc[A[[n]],2:3],aflowf[,A[[n]]],bins[[n]],np=sum(A[[n]]),perm=nperm)

moranRare[[n]]<-moran(latlong=poploc[A[[n]],2:3],afrare[,A[[n]]],bins[[n]],np=sum(A[[n]]),perm=nperm)

}

nperm<-1000

moranCommonB2<-moran(latlong=poploc[,2:3],afcommon,c(0,25,50,seq(75,1425,75)),np=66,perm=nperm)

moranLowfB2<-moran(latlong=poploc[,2:3],aflowf,c(0,25,50,seq(75,1425,75)),np=66,perm=nperm)

moranRareB2<-moran(latlong=poploc[,2:3],afrare,c(0,25,50,seq(75,1425,75)),np=66,perm=nperm)

pdf("moran.pdf",width=8,height=11)

par(mfrow=c(3,1))

par(mar=c(4.5,4,2,1))

for(n in 1:length(A)){

plot(moranCommon[[n]]$binmns,moranCommon[[n]]$correlation,type='b',xlab="distance",ylab="correlation")

x<-which(moranCommon[[n]]$p < 0.025 | moranCommon[[n]]$p > 0.975)

points(moranCommon[[n]]$binmns[x],moranCommon[[n]]$correlation[x],pch=19)

points(moranLowf[[n]]$binmns,moranLowf[[n]]$correlation,type='b',col="darkgray")

x<-which(moranLowf[[n]]$p < 0.025 | moranLowf[[n]]$p > 0.975)

points(moranLowf[[n]]$binmns[x],moranLowf[[n]]$correlation[x],pch=19,col="darkgray")

points(moranRare[[n]]$binmns,moranRare[[n]]$correlation,type='b',lty=2)

x<-which(moranRare[[n]]$p < 0.025 | moranRare[[n]]$p > 0.975)

points(moranRare[[n]]$binmns[x],moranRare[[n]]$correlation[x],pch=19)

abline(h=0,lty=3)

title(main=gnams[n])

}

dev.off()