Post date: Sep 05, 2013 10:0:59 PM
Excluding low quality (uncertain) genotype calls from recombination rate estimates made a large difference. See the summary of recombination rate estimates for variants on the same and different scaffolds (this is the result from only using genotypes with 90% certainty),
same scaffold:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01357 0.06922 0.08606 0.11010 0.11680 0.49980
different scaffolds:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00063 0.32780 0.40450 0.37840 0.45580 0.50000
My basic idea at the moment is to compute the average recombination rate between all pairs of scaffolds and to use this distance matrix to cluster scaffolds into linkage groups. Here is an example of the R code for this (this uses only a subset of 10% the data, as explained below),
## scan data, only 10%
rp0pr90<-matrix(scan("~/labs/evolution/scratch/recomb0.9_p0_offspringGenotypes.txt",n=106*4217752,sep=" "),nrow=4217752,ncol=106,byrow=T)
## subset with 10 inds, and p > 0.01 for segregation distortion
hqr<-which(rp0pr90[,6] > 0.01 & rp0pr90[,5] >= 10)
hqrest<-apply(rp0pr90[hqr,-c(1:6)],1,mean)
hqrest[hqrest>0.5]<-1-hqrest[hqrest>0.5]
hqinfo<-rp0pr90[hqr,1:6]
slist<-unique(c(unique(hqinfo[,1])),unique(hqinfo[,3]))
l<-length(slist)
dmat<-matrix(NA,nrow=l,ncol=l)
for(i in 1:l){
for(j in i:l){
dmat[i,j]<-mean(hqrest[hqinfo[,1]==slist[i] & hqinfo[,3]==slist[j]])
dmat[j,i]<-dmat[i,j]
}
}
d<-as.dist(dmat) ## may need to replace NAs
fit<-hclust(d,method="ward")
## threshold matrix
thdmat<-dmat
thdmat[thdmat>0.2]<-0.5
thdmat[thdmat<=0.2]<-0
Unfortunately, as I noted above, I am only working with a subset of the data. This is because R will not read in all of the data (it is a bunch, but I wasn't running out of RAM). I can probably circumvent this problem by only printing the point estimate of the recombination rate for each pair of SNV's. I am going to modify the code to allow for this option and re-run the recombination analysis with all three families.