defining the indel

Post date: Feb 22, 2018 12:4:22 AM

We wanted to formally define the insertion-deletion polymorphism that is evident from the coverage plots. I did this (am doing really, need a few more things) by fitting HMMs to the coverage data. Here is what I did so far.

1. Copied files with depth data from Romain to /uufs/chpc.utah.edu/common/home/u6000989/projects/timema_genarch/indel/. Here are his notes:

Here it is: /stash/shared/romain/depth4Zach

I only did compute this for scaffold 128 only. There is a line for each nucleotide of the scaffold, I could do it genome wide but this may take lots of space and time to generate, and am not sure it is necessary in our case ? What do you think ?

You'll find a folder for each species. Within each folder the results are sorted by morph or genotypes (subfolders). Within this subfolder you'll find graphs and a txt files with coverage infos.

I computed results on an individual level (i did not remember I did this) and the two last columns are number of reads accross individuals and the number of individuals that do have reads.

I did not do this with results from a pca from bartmani yet. For cristinae I did the cluster selection 'by eye'. Would it be worth having a stronger method for genotypes selection there ? Once we decide I can run this for bartmani.

2. Grab columns with coverage data from each file.

perl getCoverageData.pl T*/*/*txt

3. Fig the HMM. This is done in R with runHMM.R

## fit HMMs for coverage data

load("coverage.rdat")

library(HiddenMarkov)

N<-dim(Cmat)[2]

## standardize coverage matrix

CmatS<-Cmat

for(i in 1:N){

xna<-which(is.na(Cmat[,i])==TRUE)

CmatS[xna,i]<-0

mn<-mean(CmatS[,i])

s<-sd(CmatS[,i])

CmatS[,i]<-(CmatS[,i]-mn)/s

}

## cap at +5 sds

for(i in 1:N){

x5<-which(CmatS[,i] > 5)

CmatS[x5,i]<-5

}

L<-dim(CmatS)[1]

L20k<-floor(L/20000)

bnds<-matrix(NA,nrow=L20k,ncol=2)

bnds[,1]<-seq(1,(L-20000),20000)

bnds[,2]<-seq(20000,L,20000)

CmatS20k<-matrix(NA,nrow=L20k,ncol=N)

for(i in 1:N){

for(j in 1:L20k){

CmatS20k[j,i]<-mean(CmatS[bnds[j,1]:bnds[j,2],i])

}

## Gaussian HMM with mean and sd fixed

Mstep.normc <- function(x, cond, pm, pn){

nms <- sort(names(pm))

n <- length(x)

m <- ncol(cond$u)

if (all(nms==c("mean", "sd"))){

mean <- pm$mean

sd <- pm$sd

return(list(mean=mean, sd=sd))

}

if (all(nms == "mean")) {

mean <- pm$mean

return(list(mean = mean))

}

if (all(nms == "sd")) {

sd <- pm$sd

return(list(sd = sd))

}

stop("Invalid specification of parameters")

}

rnormc <- rnorm

dnormc <- dnorm

pnormc <- pnorm

qnormc <- qnorm

## set initial values and means

delta<-c(0.05,0.95)

PiA<-matrix(c(0.9,0.1,0.1,0.9),nrow=2,byrow=T)

PiB<-matrix(c(0.8,0.2,0.01,0.99),nrow=2,byrow=T)

Pi<-list(PiA,PiB)

## variables for storing results

fitA<-vector("list",N)

fitB<-vector("list",N)

paramA<-vector("list",N)

paramB<-vector("list",N)

## loop over species/groups

for(i in 1:N){

mns<-c(mean(CmatS20k[,i]),quantile(CmatS20k[,i],probs=c(0.1)))

sds<-c(sd(CmatS20k[,i]),0.5*sd(CmatS20k[,i]))

## rep A

init<-dthmm(CmatS20k[,i],PiA,delta,"normc",list(mean=mns,sd=sds),discrete=FALSE)

paramA[[i]]<-BaumWelch(init,bwcontrol(maxiter = 500, tol = 1e-04,

prt = TRUE, posdiff = FALSE,converge = expression(diff < tol)))

fitA[[i]]<-Viterbi(paramA[[i]])

## rep B

init<-dthmm(CmatS20k[,i],PiB,delta,"normc",list(mean=mns,sd=sds),discrete=FALSE)

paramB[[i]]<-BaumWelch(init,bwcontrol(maxiter = 500, tol = 1e-04,

prt = TRUE, posdiff = FALSE,converge = expression(diff < tol)))

fitB[[i]]<-Viterbi(paramB[[i]])

}

## ids

ids<-read.table("names_covdat.txt",sep="_",header=FALSE)

mel<-which(ids[,3]=='melanic')

grn<-which(ids[,3]=='green')

lowc<-c(8,10,12,13,14,15,16,17,18,20,21,25)

lowc_pure<-c(8,10,12,13,14,15,17,18,21,25)

## use curi pooled

save(list=ls(),file="covhmm.rdat")

We have evidence of the deletion everywhere we should. I plan to re-run this though to include bartmani and to use morphs rather than genotypes for cristinae. I will then make a 14 panel figure with the coverage and model states and then one with genotype data for the SI (and not HMM info).

Page updated

Google Sites

Report abuse