hidden markov model to identify regions of high genetic differentiation

Post date: Nov 04, 2013 4:57:22 PM

I am using a discrete state, homogeneous hidden markov model (HMM) to identify genetic regions with normal and exceptional genetic differentiation between Timema populations pairs. The HMM that I am using is implemented in the R package HiddenMarkov. The limitation of using this implementation is that I am constrained to assume the transition rate is constant between SNVs different distances away from each other. This is not perfect, but without information on recombination rates this might be the best we can do. Also there is precedent for this approach as it was proposed and used by Excoffier's group (Hofer et al. BMC Genomics 2012, 13:107).

The model and package have two key algorithms. The Baum-Welch algorithm is a variant of the EM algorithm that estimated the transition matrix and parameters of the emitting distribution from the data given the observations and a specified number of states. It might be possible to place constraints on the optimization procedure (Hofer et al. mention only allowing transitions between certain states, but I haven't figured this out yet). The Viterbi algorithm takes the model parameters as given and obtains the ML path of hidden states (called global decoding). In this conext this the the most likely assignment of SNVs to normal or high differentiation regions.

I initially tried estimating parameters from the data with the Baum-Welch algorithm. This gives high transition rates and does not yield a peaked structure. I want to look into this more, but I have mainly focused on different pre-defined parameters that would possibly uncover an island-like structure if it exists. In particular I have estimated hidden states with various lower transition probabilities (0.00001-0.0000001 for moving out of the background state and0.005 to 0.000005 for moving out of the high differentiation state). I always used the observed mean and sd for logit(Fst) for the background state and I have used means increased by 10 or 25% or set to the 3rd quartile for the alternative mean and sds increased by 10 or 25%. Some results are consistent across conditions. For example, we tend to see the most divergence beaks for LA x PRC and R12A x R12C. But the number of peaks fluctuates under different conditions. This shouldn't be surprising as the number of differentiation peaks should depend on how they are defined. So far the lowest transition rates and 10% mean and sd increase provide some of the largest and most island-like divergence peaks. I will try going farther in this direction to determine whether a set of parameters exist that would classify the 'putative inversion' (or whatever this actually is) between R12A and R12C as a single divergence peak.

My results are projects/timema_wgwild/hmm/ and the R code so far is below:

library(HiddenMarkov)

nl<-4391556

np<-4

Q<-read.table("parallelismInformation.txt",header=T,sep=" ")

QFst<-matrix(scan("fstQuantiles.txt",n=nl*np,sep=" "),nrow=nl,ncol=np,byrow=TRUE)

fst<-matrix(scan("fstEst.txt",n=nl*np,sep=" "),nrow=nl,ncol=np,byrow=TRUE)

## logit transform fst

lfst<-log(fst/(1-fst))

lfst[is.na(lfst)==TRUE]<-log(0.0001/0.9999)

lfst[is.infinite(lfst)==TRUE]<-log(0.0001/0.9999)

## set two transistion matrixes

Pi2<-matrix(c(0.999999,0.000001,0.0005,0.9995),byrow=TRUE,nrow=2)

Pi1<-matrix(c(0.9999999,0.0000001,0.000005,0.999995),byrow=TRUE,nrow=2)

Pi<-list(pi1=Pi1,pi2=Pi2)

delta<-c(1,0)

## state 1 mns and sds

mns<-apply(lfst,2,mean)

sds<-apply(lfst,2,sd)

## state 2 mns sds

mnsQ3<-apply(lfst,2,quantile,probs=0.75)

mns10pct<-mns - 0.1 * mns

mns25pct<-mns - 0.25 * mns

sds10pct<-sds*1.1

sds25pct<-sds*1.25

ismns<-list(mnsQ3,mns10pct,mns25pct)

issds<-list(sds10pct,sds25pct)

nis<-array(dim=c(4,2,3,2))

fits<-array(dim=c(4,2,3,2,nl))

mnNames<-c("Q3","plus10pct","plus25pct")

sdNames<-c("plus10pct","plus25pct")

## loop through model combos

for (pp in 1:4){## population pairs

for (tm in 1:2){ ## transition matrix

for (ims in 1:3){ ## island means

for (isd in 1:2){ ## island sds

init<-dthmm(lfst[,pp],Pi[[tm]],delta,"norm",list(mean=c(mns[pp],ismns[[ims]][pp]),sd=c(sds[pp],issds[[isd]][pp])))

fitMl<-Viterbi(init)

ni<-length(isbnds)/2 ## num. islands

nis[pp,tm,ims,isd]<-ni

fits[pp,tm,ims,isd,]<-fitMl

}

for (pp in 1:4){## population pairs

for (tm in 1:2){ ## transition matrix

for (ims in 1:3){ ## island means

for (isd in 1:2){ ## island sds

isbnds<-which(abs(fits[pp,tm,ims,isd,1:(nl-1)]-fits[pp,tm,ims,isd,2:nl])==1)

ni<-length(isbnds)/2 ## num. islands

nis[pp,tm,ims,isd]<-ni

}

locinfo<-read.table("../fst/locids.txt",header=F)

bnds<-which(as.character(locinfo[2:nl,1]) != as.character(locinfo[1:(nl-1),1]))

for (tm in 1:2){ ## transition matrix

for (ims in 1:3){ ## island means

for (isd in 1:2){ ## island sds

if (tm == 1){

mytitle<-paste("figHmmTmLowerIsMn",mnNames[ims],"IsSd",sdNames[isd],".png",sep="")

}

else{

mytitle<-paste("figHmmTmLowIsMn",mnNames[ims],"IsSd",sdNames[isd],".png",sep="")

}

png(mytitle,width=36,height=12,units="in",res=480)

par(mfrow=c(4,1))

par(mar=c(4.5,4.5,0.5,0.5))

for (pp in 1:4){## population pairs

isbnds<-which(abs(fits[pp,tm,ims,isd,1:(nl-1)]-fits[pp,tm,ims,isd,2:nl])==1)

ni<-nis[pp,tm,ims,isd]

plot(fst[,pp],type='n',xlab="locus",ylab=expression(paste("genetic differentiation (",F[ST],")")),cex.lab=1.4,cex.axis=1.1,ylim=c(0,0.8))

for(j in seq(1,((ni*2)-1),2)){

polygon(isbnds[c(j,j+1,j+1,j,j)],c(0,0,1,1,0),density=50,col="green",border=NA)

}

points(fst[,pp],col=c("black","blue")[fits[pp,tm,ims,isd,]])

abline(v=bnds,lty=2,lwd=2,col="gray")

text(50000,0.75,paste("no. diff. peaks = ",ni))

}

dev.off()

}

#isbnds<-which(abs(fitMl4[1:(nl-1)]-fitMl4[2:nl])==1)

#ni<-length(isbnds)

#

#plot(fst[,4],type='n',xlab="locus",ylab=expression(paste("genetic differentiation (",F[ST],")")),cex.lab=1.5,cex.axis=1.3)

#for(j in seq(1,(ni-1),2)){

# polygon(isbnds[c(j,j+1,j+1,j,j)],c(0,0,1,1,0),density=50,col="green",border=NA)

#}

#points(fst[,4],col=c("black","blue")[fitMl4])

#dev.off()