Post date: Feb 22, 2018 12:4:22 AM
We wanted to formally define the insertion-deletion polymorphism that is evident from the coverage plots. I did this (am doing really, need a few more things) by fitting HMMs to the coverage data. Here is what I did so far.
1. Copied files with depth data from Romain to /uufs/chpc.utah.edu/common/home/u6000989/projects/timema_genarch/indel/. Here are his notes:
Here it is: /stash/shared/romain/depth4Zach
I only did compute this for scaffold 128 only. There is a line for each nucleotide of the scaffold, I could do it genome wide but this may take lots of space and time to generate, and am not sure it is necessary in our case ? What do you think ?
You'll find a folder for each species. Within each folder the results are sorted by morph or genotypes (subfolders). Within this subfolder you'll find graphs and a txt files with coverage infos.
I computed results on an individual level (i did not remember I did this) and the two last columns are number of reads accross individuals and the number of individuals that do have reads.
I did not do this with results from a pca from bartmani yet. For cristinae I did the cluster selection 'by eye'. Would it be worth having a stronger method for genotypes selection there ? Once we decide I can run this for bartmani.
2. Grab columns with coverage data from each file.
perl getCoverageData.pl T*/*/*txt
3. Fig the HMM. This is done in R with runHMM.R
## fit HMMs for coverage data
load("coverage.rdat")
library(HiddenMarkov)
N<-dim(Cmat)[2]
## standardize coverage matrix
CmatS<-Cmat
for(i in 1:N){
xna<-which(is.na(Cmat[,i])==TRUE)
CmatS[xna,i]<-0
mn<-mean(CmatS[,i])
s<-sd(CmatS[,i])
CmatS[,i]<-(CmatS[,i]-mn)/s
}
## cap at +5 sds
for(i in 1:N){
x5<-which(CmatS[,i] > 5)
CmatS[x5,i]<-5
}
L<-dim(CmatS)[1]
L20k<-floor(L/20000)
bnds<-matrix(NA,nrow=L20k,ncol=2)
bnds[,1]<-seq(1,(L-20000),20000)
bnds[,2]<-seq(20000,L,20000)
CmatS20k<-matrix(NA,nrow=L20k,ncol=N)
for(i in 1:N){
for(j in 1:L20k){
CmatS20k[j,i]<-mean(CmatS[bnds[j,1]:bnds[j,2],i])
}
}
## Gaussian HMM with mean and sd fixed
Mstep.normc <- function(x, cond, pm, pn){
nms <- sort(names(pm))
n <- length(x)
m <- ncol(cond$u)
if (all(nms==c("mean", "sd"))){
mean <- pm$mean
sd <- pm$sd
return(list(mean=mean, sd=sd))
}
if (all(nms == "mean")) {
mean <- pm$mean
return(list(mean = mean))
}
if (all(nms == "sd")) {
sd <- pm$sd
return(list(sd = sd))
}
stop("Invalid specification of parameters")
}
rnormc <- rnorm
dnormc <- dnorm
pnormc <- pnorm
qnormc <- qnorm
## set initial values and means
delta<-c(0.05,0.95)
PiA<-matrix(c(0.9,0.1,0.1,0.9),nrow=2,byrow=T)
PiB<-matrix(c(0.8,0.2,0.01,0.99),nrow=2,byrow=T)
Pi<-list(PiA,PiB)
## variables for storing results
fitA<-vector("list",N)
fitB<-vector("list",N)
paramA<-vector("list",N)
paramB<-vector("list",N)
## loop over species/groups
for(i in 1:N){
mns<-c(mean(CmatS20k[,i]),quantile(CmatS20k[,i],probs=c(0.1)))
sds<-c(sd(CmatS20k[,i]),0.5*sd(CmatS20k[,i]))
## rep A
init<-dthmm(CmatS20k[,i],PiA,delta,"normc",list(mean=mns,sd=sds),discrete=FALSE)
paramA[[i]]<-BaumWelch(init,bwcontrol(maxiter = 500, tol = 1e-04,
prt = TRUE, posdiff = FALSE,converge = expression(diff < tol)))
fitA[[i]]<-Viterbi(paramA[[i]])
## rep B
init<-dthmm(CmatS20k[,i],PiB,delta,"normc",list(mean=mns,sd=sds),discrete=FALSE)
paramB[[i]]<-BaumWelch(init,bwcontrol(maxiter = 500, tol = 1e-04,
prt = TRUE, posdiff = FALSE,converge = expression(diff < tol)))
fitB[[i]]<-Viterbi(paramB[[i]])
}
## ids
ids<-read.table("names_covdat.txt",sep="_",header=FALSE)
mel<-which(ids[,3]=='melanic')
grn<-which(ids[,3]=='green')
lowc<-c(8,10,12,13,14,15,16,17,18,20,21,25)
lowc_pure<-c(8,10,12,13,14,15,17,18,21,25)
## use curi pooled
save(list=ls(),file="covhmm.rdat")
We have evidence of the deletion everywhere we should. I plan to re-run this though to include bartmani and to use morphs rather than genotypes for cristinae. I will then make a 14 panel figure with the coverage and model states and then one with genotype data for the SI (and not HMM info).