Post date: Mar 24, 2016 10:11:50 PM
We had a new genome assembly done by Dovetail, which included a Chicago library. Their report is here (Dovetail genome report). The genome assembly is here: king:/uufs/chpc.utah.edu/common/home/u6000989/data/tcrDovetail/timema_26Feb2016_g9wYt.fasta
My plan is to re-designate LGs with this genome based on the old mapping families. Here is what I did.
1. index the new genome with bwa (using 0.7.10-r789).
bwa index -a is timema_26Feb2016_g9wYt.fasta
2. Align mapping family data from second run (second round of sequencing) to the reference genome with bwa.
I set-up a new directory to run the Dovetail genome alignments. Symbolic links to all of the parsed reads from the second round of sequencing (from NCGR) are in king:/uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/parsed/MAP*fastq. From within that directory I am using bwa aln and samse to align the data to the Dovetail genome. Here was the call:
perl ../../scripts/wrap_qsub_slurm_bwa.pl MAP25*fastq
And an example command:
cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/parsed/
bwa aln -n 4 -l 20 -k 2 -t 8 -q 10 -f alnMAP25299.sai timema_26Feb2016_g9wYt.fasta MAP25299.fastq
bwa samse -n 1 -r '@RG ID:timema-MAP25299' -f alnMAP25299.sam timema_26Feb2016_g9wYt.fasta alnMAP25299.sai MAP25299.fastq
cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/parsed/
bwa aln -n 4 -l 20 -k 2 -t 8 -q 10 -f alnMAP25300.sai timema_26Feb2016_g9wYt.fasta MAP25300.fastq
bwa samse -n 1 -r '@RG ID:timema-MAP25300' -f alnMAP25300.sam timema_26Feb2016_g9wYt.fasta alnMAP25300.sai MAP25300.fastq
3. Compressed, sorted and indexed the alignment:
perl ../../scripts/wrap_qsub_slurm_sam2bam.pl alnMAP25*sam
cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/parsed/
samtools view -b -S -o alnMAP25300.bam alnMAP25300.sam
samtools sort alnMAP25300.bam alnMAP25300.sorted
samtools index alnMAP25300.sorted.bam
3. The fastq file for the first sequencing run (SeqWright, I think) is trimmed_T1_all.fastq (in /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/timema_linkmap/). It needs to be parsed, and it has some incomplete records (i.e., poorly formatted lines). I first fixed the formatting with removePartials.pl. I then split the fastq file for simultaneous parsing. Here is an example run:
perl parse_barcodes768.pl Timema_LCA_Lib1_key.csv xae
4. I then used bwa to align the sequences to the reference and samtools to compress, index and sort the alignments as described above.
5. I moved all of the alignments (directly or via symbolic links) to the mapdata subdirectory. I then merged the alignments with samtools merge. Here is an example:
cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/Alignments/
samtools merge -n -h alnxMAP25300.sorted.bam mergeMAP25300.sorted.bam alnxMAP25300.sorted.bam alnMAP25300.sorted.bam
samtools sort mergeMAP25300.sorted.bam mergeMAP25300.sorted.sorted
samtools index mergeMAP25300.sorted.sorted.bam
6. Next I sorted the alignments by family for variant calling. I ran a separate jobvar*sh script for each family. Note that I am using newer versions of samtools (1.2) and bcftools (1.3) and that these are rather different. One key note here is that I am requiring a minimum mean epth of 4. Here is the example command (I might want to further refine this... there are standard workflows for whole genomes, but not for GBS):
cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/fam25270by25271/
samtools mpileup -f ../../../tcrDovetail/timema_26Feb2016_g9wYt.fasta -q 10 -Q 15 -D -g -u -I merg*sorted.sorted.bam > fam25270x25271.bcf
~/bin/bcftools call -o fam25270x25271.vcf -O v -f GQ -V indels -v -c -p 0.01 -P 1e-3 fam25270x25271.bcf
~/bin/vcfutils.pl varFilter -d 200 -w 5 -e 1e-10 fam25270x25271.vcf > filter_fam25270x25271.vcf
I ended up with 60,462 SNPs for family A (25266 x 26267), 29,600 SNPs for family B (25268 x 26269), 46,352 SNPs for family A (25270 x 26271).
7. Estimate recombination rates, first perl scripts are used to parse the data, then my recomb program is run:
## FAM A
perl ../../scripts/getParentGentoypes.pl 114 115 filter_fam25266x25267.vcf
perl ../../scripts/thresholdParentGenotypes.pl 0.95 parentGenotypes.txt
#Found 6173 recombination informative loci
perl ../../scripts/calculateOffspringGenotypeProbs.pl 114 115 sub_parentGenotypes.txt filter_fam25266x25267.vcf
perl ../../scripts/splitP0P1.pl offspringGenotypes.txt
recomb -i p0_offspringGenotypes.txt -o recombOutP0A.txt -n 10 -m 0.95 -p 0
recomb -i p1_offspringGenotypes.txt -o recombOutP1A.txt -n 10 -m 0.95 -p 0
## FAM B
perl ../../scripts/getParentGentoypes.pl 24 25 filter_fam25268x25269.vcf
perl ../../scripts/thresholdParentGenotypes.pl 0.95 parentGenotypes.txt
#Found 2189 recombination informative loci
perl ../../scripts/calculateOffspringGenotypeProbs.pl 24 25 sub_parentGenotypes.txt filter_fam25268x25269.vcf
perl ../../scripts/splitP0P1.pl offspringGenotypes.txt
recomb -i p0_offspringGenotypes.txt -o recombOutP0A.txt -n 10 -m 0.95 -p 0
recomb -i p1_offspringGenotypes.txt -o recombOutP1A.txt -n 10 -m 0.95 -p 0
## FAM C
perl ../../scripts/getParentGentoypes.pl 19 20 filter_fam25270x25271.vcf
perl ../../scripts/thresholdParentGenotypes.pl 0.95 parentGenotypes.txt
#Found 919 recombination informative loci
perl ../../scripts/calculateOffspringGenotypeProbs.pl 19 20 sub_parentGenotypes.txt filter_fam25270x25271.vcf
perl ../../scripts/splitP0P1.pl offspringGenotypes.txt
recomb -i p0_offspringGenotypes.txt -o recombOutP0A.txt -n 10 -m 0.95 -p 0
recomb -i p1_offspringGenotypes.txt -o recombOutP1A.txt -n 10 -m 0.95 -p 0
8. Finally, I made a first pass at assigning LGs (using only Fam A for now). These results are in data/timema/tcrDovetail/LGs. The script is makelg.R, key parts of which are below:
#library(gplots)
## read data = recombinationr rates and quality
N<-10426846
recomb<-matrix(scan("/uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_mappingfams/mapdata/fam25266by25267/recombOutA-comb.txt",n=7*N,sep=" "),nrow=N,ncol=7,byrow=T)
## identify and select the high quality subset of the data, Mendealian two-tailed prob. of 0.01 (0.005 on each side), and 50 indviduals
hqr<-which(recomb[,6] > 0.005 & recomb[,5] >= 10)
hqdata<-recomb[hqr,]
## orientate recombination fractions by folding
hqdata[hqdata[,7] > 0.5,7]<-1-hqdata[hqdata[,7] > 0.5,7]
## quantile plots for recomb rate, same vs. different scaffolds
ss<-which(hqdata[,1]==hqdata[,3])
ds<-which(hqdata[,1]!=hqdata[,3])
nss<-length(ss)
nds<-length(ds)
pdf("recombQuantilePlotFA.pdf",width=6,height=6)
plot(seq(1,nds,1)/nds,sort(hqdata[ds,7]),type='l',lwd=2,cex.lab=1.4,cex.axis=1.2,xlab="quantile",ylab="recombination fraction")
lines(seq(1,nss,1)/nss,sort(hqdata[ss,7]),lwd=2,col="orange")
legend(0.05,0.5,legend=c("same scaf.","diff. scaf."),col=c("orange","black"),lwd=2)
dev.off()
## plot distance vs. recomb. within scaffolds
pdf("recombDistPlot.pdf",width=6,height=6)
plot(hqdata[ss,4]-hqdata[ss,2],hqdata[ss,7],cex.lab=1.4,cex.axis=1.2,xlab="physical distance",ylab="recombinatrion fraction")
dev.off()
cor.test(hqdata[ss,4]-hqdata[ss,2],hqdata[ss,7])
## r = 0.11, p = 0.08
## mean recomb. for scaffolds, create pairwise matrix
slist<-unique(c(hqdata[,1],hqdata[,3]))
l<-length(slist)
dmat<-matrix(0,nrow=l,ncol=l)
dmatmin<-matrix(0.5,nrow=l,ncol=l)
nmat<-matrix(0,nrow=l,ncol=l)
N<-dim(hqdata)[1]
for(n in 1:N){
cat(n,"\n")
a<-which(slist==hqdata[n,1])
b<-which(slist==hqdata[n,3])
nmat[a,b]<-nmat[a,b] + 1
dmat[a,b]<-dmat[a,b] + hqdata[n,7]
if (dmatmin[a,b] > hqdata[n,7]){
dmatmin[a,b] <- hqdata[n,7]
}
}
dmatf<-dmat/nmat
for(i in 1:l){
for(j in i:l){
dmatf[j,i]<-dmatf[i,j]
dmatmin[j,i]<-dmatmin[i,j]
}
}
dmatf2<-dmatf ## replace nan with 0.5, this is conservative
dmatf2[is.nan(dmatf)]<-0.5
## try with pca, kmeans cluster, lda
#outpca<-prcomp(dmatf2,center=T)
outpca<-prcomp(dmatmin,center=T)
npca<-15
#ss<-numeric(50)
#ss[1]<-0
#for(i in 2:50){
# out<-kmeans(x=outpca$x[,1:npca],i,iter.max=1000,nstart=100,algorithm="Hartigan-Wong")
# ss[i]<-out$betweenss/out$totss
#}
K<-16
outk<-kmeans(x=outpca$x[,1:npca],K,iter.max=1000,nstart=100,algorithm="Hartigan-Wong")
outlda<-lda(x=as.matrix(outpca$x[,1:npca]),grouping=outk$cluster,CV=TRUE)
lg<-as.numeric(as.character(outlda$class))
maxlg<-apply(outlda$posterior,1,max)
lg[maxlg<0.95]<-NA
mns<-numeric(K)
nsc<-numeric(K)
for(i in 1:K){
#mns[i]<-mean(dmatf2[long[lg==i],long[lg==i]],na.rm=TRUE)
mns[i]<-mean(dmatmin[lg==i,lg==i],na.rm=TRUE)
nsc[i]<-sum(lg==i,na.rm=TRUE)
}
nlg<-sum(mns < 0.3) ## 13 clusters have a mean recombination rate less than 0.3, these are my lgs for now
goodLgs<-which(mns < 0.3)
## get names (numbers) for long scaffolds
lslist<-slist
timMap<-vector(mode='list',13)
j<-1
for(i in goodLgs){
lglslist<-lslist[which(lg==i)] ## these are the scaffold numbers on this linkage group
out<-UGlinkage(dmatmin[which(lg==i),which(lg==i)])
# out<-UGlinkage(dmatf2[long[which(lg==i)],long[which(lg==i)]])
timMap[[j]]<-list(lglslist[out$order],out$neighbor.dist)
j<-j+1
}
## write results
for(i in 1:nlg){
a<-paste("lg",i,".txt",sep="")
cat(length(timMap[[i]][[1]])," ",length(timMap[[i]][[2]])," ",i,"\n")
write.table(x=cbind(timMap[[i]][[1]],round(timMap[[i]][[2]],4)),file=a,row.names=FALSE,col.names=FALSE,quote=FALSE)
}
save(list=ls(),file="tcrLgs.rdat")
Some things are encouraging, within scaffold estimates of recombination are almost always lower than between scaffold and their is a hint of a positive correlation between distance and recombination within scaffolds, but there still seems to be quite a bit of noise in the data (in part because we lack recombination estimates for many pairs of scaffolds). Before moving forward, I want to try a comparative alignment between the old and new genomes to see how things look and whether they match up.