genome from Dovetail and new LGs

Post date: Mar 24, 2016 10:11:50 PM

We had a new genome assembly done by Dovetail, which included a Chicago library. Their report is here (Dovetail genome report). The genome assembly is here: king:/uufs/chpc.utah.edu/common/home/u6000989/data/tcrDovetail/timema_26Feb2016_g9wYt.fasta

My plan is to re-designate LGs with this genome based on the old mapping families. Here is what I did.

1. index the new genome with bwa (using 0.7.10-r789).

bwa index -a is timema_26Feb2016_g9wYt.fasta

2. Align mapping family data from second run (second round of sequencing) to the reference genome with bwa.

I set-up a new directory to run the Dovetail genome alignments. Symbolic links to all of the parsed reads from the second round of sequencing (from NCGR) are in king:/uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/parsed/MAP*fastq. From within that directory I am using bwa aln and samse to align the data to the Dovetail genome. Here was the call:

perl ../../scripts/wrap_qsub_slurm_bwa.pl MAP25*fastq

And an example command:

cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/parsed/

bwa aln -n 4 -l 20 -k 2 -t 8 -q 10 -f alnMAP25299.sai timema_26Feb2016_g9wYt.fasta MAP25299.fastq

bwa samse -n 1 -r '@RG ID:timema-MAP25299' -f alnMAP25299.sam timema_26Feb2016_g9wYt.fasta alnMAP25299.sai MAP25299.fastq

cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/parsed/

bwa aln -n 4 -l 20 -k 2 -t 8 -q 10 -f alnMAP25300.sai timema_26Feb2016_g9wYt.fasta MAP25300.fastq

bwa samse -n 1 -r '@RG ID:timema-MAP25300' -f alnMAP25300.sam timema_26Feb2016_g9wYt.fasta alnMAP25300.sai MAP25300.fastq

3. Compressed, sorted and indexed the alignment:

perl ../../scripts/wrap_qsub_slurm_sam2bam.pl alnMAP25*sam

cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/parsed/

samtools view -b -S -o alnMAP25300.bam alnMAP25300.sam

samtools sort alnMAP25300.bam alnMAP25300.sorted

samtools index alnMAP25300.sorted.bam

3. The fastq file for the first sequencing run (SeqWright, I think) is trimmed_T1_all.fastq (in /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/timema_linkmap/). It needs to be parsed, and it has some incomplete records (i.e., poorly formatted lines). I first fixed the formatting with removePartials.pl. I then split the fastq file for simultaneous parsing. Here is an example run:

perl parse_barcodes768.pl Timema_LCA_Lib1_key.csv xae

4. I then used bwa to align the sequences to the reference and samtools to compress, index and sort the alignments as described above.

5. I moved all of the alignments (directly or via symbolic links) to the mapdata subdirectory. I then merged the alignments with samtools merge. Here is an example:

cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/Alignments/

samtools merge -n -h alnxMAP25300.sorted.bam mergeMAP25300.sorted.bam alnxMAP25300.sorted.bam alnMAP25300.sorted.bam

samtools sort mergeMAP25300.sorted.bam mergeMAP25300.sorted.sorted

samtools index mergeMAP25300.sorted.sorted.bam

6. Next I sorted the alignments by family for variant calling. I ran a separate jobvar*sh script for each family. Note that I am using newer versions of samtools (1.2) and bcftools (1.3) and that these are rather different. One key note here is that I am requiring a minimum mean epth of 4. Here is the example command (I might want to further refine this... there are standard workflows for whole genomes, but not for GBS):

cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/fam25270by25271/

samtools mpileup -f ../../../tcrDovetail/timema_26Feb2016_g9wYt.fasta -q 10 -Q 15 -D -g -u -I merg*sorted.sorted.bam > fam25270x25271.bcf

~/bin/bcftools call -o fam25270x25271.vcf -O v -f GQ -V indels -v -c -p 0.01 -P 1e-3 fam25270x25271.bcf

~/bin/vcfutils.pl varFilter -d 200 -w 5 -e 1e-10 fam25270x25271.vcf > filter_fam25270x25271.vcf

I ended up with 60,462 SNPs for family A (25266 x 26267), 29,600 SNPs for family B (25268 x 26269), 46,352 SNPs for family A (25270 x 26271).

7. Estimate recombination rates, first perl scripts are used to parse the data, then my recomb program is run:

## FAM A

perl ../../scripts/getParentGentoypes.pl 114 115 filter_fam25266x25267.vcf

perl ../../scripts/thresholdParentGenotypes.pl 0.95 parentGenotypes.txt

#Found 6173 recombination informative loci

perl ../../scripts/calculateOffspringGenotypeProbs.pl 114 115 sub_parentGenotypes.txt filter_fam25266x25267.vcf

perl ../../scripts/splitP0P1.pl offspringGenotypes.txt

recomb -i p0_offspringGenotypes.txt -o recombOutP0A.txt -n 10 -m 0.95 -p 0

recomb -i p1_offspringGenotypes.txt -o recombOutP1A.txt -n 10 -m 0.95 -p 0

## FAM B

perl ../../scripts/getParentGentoypes.pl 24 25 filter_fam25268x25269.vcf

perl ../../scripts/thresholdParentGenotypes.pl 0.95 parentGenotypes.txt

#Found 2189 recombination informative loci

perl ../../scripts/calculateOffspringGenotypeProbs.pl 24 25 sub_parentGenotypes.txt filter_fam25268x25269.vcf

perl ../../scripts/splitP0P1.pl offspringGenotypes.txt

recomb -i p0_offspringGenotypes.txt -o recombOutP0A.txt -n 10 -m 0.95 -p 0

recomb -i p1_offspringGenotypes.txt -o recombOutP1A.txt -n 10 -m 0.95 -p 0

## FAM C

perl ../../scripts/getParentGentoypes.pl 19 20 filter_fam25270x25271.vcf

perl ../../scripts/thresholdParentGenotypes.pl 0.95 parentGenotypes.txt

#Found 919 recombination informative loci

perl ../../scripts/calculateOffspringGenotypeProbs.pl 19 20 sub_parentGenotypes.txt filter_fam25270x25271.vcf

perl ../../scripts/splitP0P1.pl offspringGenotypes.txt

recomb -i p0_offspringGenotypes.txt -o recombOutP0A.txt -n 10 -m 0.95 -p 0

recomb -i p1_offspringGenotypes.txt -o recombOutP1A.txt -n 10 -m 0.95 -p 0

8. Finally, I made a first pass at assigning LGs (using only Fam A for now). These results are in data/timema/tcrDovetail/LGs. The script is makelg.R, key parts of which are below:

#library(gplots)

## read data = recombinationr rates and quality

N<-10426846

recomb<-matrix(scan("/uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_mappingfams/mapdata/fam25266by25267/recombOutA-comb.txt",n=7*N,sep=" "),nrow=N,ncol=7,byrow=T)

## identify and select the high quality subset of the data, Mendealian two-tailed prob. of 0.01 (0.005 on each side), and 50 indviduals

hqr<-which(recomb[,6] > 0.005 & recomb[,5] >= 10)

hqdata<-recomb[hqr,]

## orientate recombination fractions by folding

hqdata[hqdata[,7] > 0.5,7]<-1-hqdata[hqdata[,7] > 0.5,7]

## quantile plots for recomb rate, same vs. different scaffolds

ss<-which(hqdata[,1]==hqdata[,3])

ds<-which(hqdata[,1]!=hqdata[,3])

nss<-length(ss)

nds<-length(ds)

pdf("recombQuantilePlotFA.pdf",width=6,height=6)

plot(seq(1,nds,1)/nds,sort(hqdata[ds,7]),type='l',lwd=2,cex.lab=1.4,cex.axis=1.2,xlab="quantile",ylab="recombination fraction")

lines(seq(1,nss,1)/nss,sort(hqdata[ss,7]),lwd=2,col="orange")

legend(0.05,0.5,legend=c("same scaf.","diff. scaf."),col=c("orange","black"),lwd=2)

dev.off()

## plot distance vs. recomb. within scaffolds

pdf("recombDistPlot.pdf",width=6,height=6)

plot(hqdata[ss,4]-hqdata[ss,2],hqdata[ss,7],cex.lab=1.4,cex.axis=1.2,xlab="physical distance",ylab="recombinatrion fraction")

dev.off()

cor.test(hqdata[ss,4]-hqdata[ss,2],hqdata[ss,7])

## r = 0.11, p = 0.08

## mean recomb. for scaffolds, create pairwise matrix

slist<-unique(c(hqdata[,1],hqdata[,3]))

l<-length(slist)

dmat<-matrix(0,nrow=l,ncol=l)

dmatmin<-matrix(0.5,nrow=l,ncol=l)

nmat<-matrix(0,nrow=l,ncol=l)

N<-dim(hqdata)[1]

for(n in 1:N){

cat(n,"\n")

a<-which(slist==hqdata[n,1])

b<-which(slist==hqdata[n,3])

nmat[a,b]<-nmat[a,b] + 1

dmat[a,b]<-dmat[a,b] + hqdata[n,7]

if (dmatmin[a,b] > hqdata[n,7]){

dmatmin[a,b] <- hqdata[n,7]

}

dmatf<-dmat/nmat

for(i in 1:l){

for(j in i:l){

dmatf[j,i]<-dmatf[i,j]

dmatmin[j,i]<-dmatmin[i,j]

}

dmatf2<-dmatf ## replace nan with 0.5, this is conservative

dmatf2[is.nan(dmatf)]<-0.5

## try with pca, kmeans cluster, lda

#outpca<-prcomp(dmatf2,center=T)

outpca<-prcomp(dmatmin,center=T)

npca<-15

#ss<-numeric(50)

#ss[1]<-0

#for(i in 2:50){

# out<-kmeans(x=outpca$x[,1:npca],i,iter.max=1000,nstart=100,algorithm="Hartigan-Wong")

# ss[i]<-out$betweenss/out$totss

K<-16

outk<-kmeans(x=outpca$x[,1:npca],K,iter.max=1000,nstart=100,algorithm="Hartigan-Wong")

outlda<-lda(x=as.matrix(outpca$x[,1:npca]),grouping=outk$cluster,CV=TRUE)

lg<-as.numeric(as.character(outlda$class))

maxlg<-apply(outlda$posterior,1,max)

lg[maxlg<0.95]<-NA

mns<-numeric(K)

nsc<-numeric(K)

for(i in 1:K){

#mns[i]<-mean(dmatf2[long[lg==i],long[lg==i]],na.rm=TRUE)

mns[i]<-mean(dmatmin[lg==i,lg==i],na.rm=TRUE)

nsc[i]<-sum(lg==i,na.rm=TRUE)

}

nlg<-sum(mns < 0.3) ## 13 clusters have a mean recombination rate less than 0.3, these are my lgs for now

goodLgs<-which(mns < 0.3)

## get names (numbers) for long scaffolds

lslist<-slist

timMap<-vector(mode='list',13)

j<-1

for(i in goodLgs){

lglslist<-lslist[which(lg==i)] ## these are the scaffold numbers on this linkage group

out<-UGlinkage(dmatmin[which(lg==i),which(lg==i)])

# out<-UGlinkage(dmatf2[long[which(lg==i)],long[which(lg==i)]])

timMap[[j]]<-list(lglslist[out$order],out$neighbor.dist)

j<-j+1

}

## write results

for(i in 1:nlg){

a<-paste("lg",i,".txt",sep="")

cat(length(timMap[[i]][[1]])," ",length(timMap[[i]][[2]])," ",i,"\n")

write.table(x=cbind(timMap[[i]][[1]],round(timMap[[i]][[2]],4)),file=a,row.names=FALSE,col.names=FALSE,quote=FALSE)

}

save(list=ls(),file="tcrLgs.rdat")

Some things are encouraging, within scaffold estimates of recombination are almost always lower than between scaffold and their is a hint of a positive correlation between distance and recombination within scaffolds, but there still seems to be quite a bit of noise in the data (in part because we lack recombination estimates for many pairs of scaffolds). Before moving forward, I want to try a comparative alignment between the old and new genomes to see how things look and whether they match up.

Page updated

Google Sites

Report abuse