Trees for 1000 bp SNP windows

Post date: Nov 01, 2016 10:58:54 PM

First, subset the genome by LG (results for each LG in a subdirectory)

perl mkLgSubsets.pl

perl moveFiles.pl LycWGSnpAlignment.subAutos.subLg*

The data for each LG are in a LG specific subdirectory, but are linked to the allLgs sub-directory.

ML trees were inferred for 1000 bp non-overlapping windows along each LG using the ML algorithm in RAxML (as with the whole genome tree). I used the wrapWindows.pl script for this, which runs my mkRunWindows.pl script:

#!/usr/bin/perl

for $i (1..22){

print "working on LG $i\n";

system "perl ../mkRunWindows.pl LycWGSnpAlignment.subAutos.subLg$i > log\n";

}

#!/usr/bin/perl

# generates alignments of $win SNPs and generates a tree for each alignment

# run from within a sub-directory, give the alignment file

$win = 1000; ## length of window

$off = $win -1;

## get information

$file = shift(@ARGV);

$file =~ m/(\d+)$/;

$lg = $1;

open(IN, $file) or die "failed to open $file\n";

$hdr = <IN>;

$hdr =~ m/\d+\s+(\d+)/ or die "failed to match $hdr\n";

$Nsnp = $1;

close(IN);

## run through windows

$winub = $win;

while($winlb <= $Nsnp){

open (SUB, "> winSub") or die "failed to write sub file\n";

$winlb = $winub - $off;

if ($winlb == 1) { ## first winow

$lb = $winub+1;

$ub = $Nsnp;

print SUB "$lb-$ub\n";

}

else {

$lb1 = 1;

$ub1 = $winlb-1;

$lb2 = $winub+1;

$ub2 = $Nsnp;

print SUB "$lb1-$ub1\n$lb2-$ub2\n";

}

print "~/bin/raxmlHPC-SSE3 -E winSub -s $file -m GTRCAT -n LG$lg\n";

system "~/bin/raxmlHPC-SSE3 -E winSub -s $file -m GTRCAT -n LG$lg\n";

print "~/bin/raxmlHPC-SSE3 -f a -x 12345 -p 12345 -# 100 -m GTRGAMMA -s LycWGSnpAlignment.subAutos.subLg$lg.winSub -n lg$lg"."win$winlb\n";

system "~/bin/raxmlHPC-SSE3 -f a -x 12345 -p 12345 -# 100 -m GTRGAMMA -s LycWGSnpAlignment.subAutos.subLg$lg.winSub -n lg$lg"."win$winlb\n";

$winub+=$win;

}

I then used the R (3.2.3) packages ape (v 3.4) and phangorn (v 2.0.2) to summarize and plot the trees. Here are the key bits (R code below).

1. The five most common topologies include 79.4% of the 935 trees (across all LGs). These are shown in phylo.pdf along with the genome tree (which matches the most common one). Note, there are 14 distinct topologies.
2. Proportion of trees with Sierra Nevada (S) or Warner Mts. (W) sister to L. melissa (M) or L. anna (A): SM = 15%, SA = 56.3%, WM = 31.8%, WA = 14.7%.
3. By comparison, the sister group SW occurs in 6% of the trees. Conclusion = S and W are (mostly or entirely) independent lineages.
4. Proportion of times S or W is closer to A vs. M (not necessarily sister): SM = 24.5%, SA = 75%, WM = 73.3%, WA = 24% (do not add to 100% as S or W can be equally close to A and M, but this is rare).
5. Of the 935 trees (and 14 topologies), 207 adjacent windows have the same topology. A simple randomization test (N = 1000 randomizations) shows that this is more autocorrelation than expected by chance: mean null = 155.7, 95th quantile =173, max = 189, enrichment = 1.33, p < 0.001.

library("phangorn")

mltr<-read.tree("RAxML_bestTree.raxmlAutos")

mltr$tip.label<-c("Sierra_Nevada","L_idas","L_melissa","Warner_Mts","L_anna")

trs<-read.tree("alltrees")

o<-myUniqueTrees(trs)

cnts<-table(o[[2]])

props<-cnts/sum(cnts)

best<-c(2,3,9,8,1)

## figure

cols<-c("orange","brown","forestgreen","darkblue","darkgray")

pdf("phylo.pdf",width=8,height=10)

par(mfrow=c(3,2))

par(mar=c(1.5,1.5,3,1.5))

plot(midpoint(mltr),tip.color=cols[rank(mltr$tip.label)],cex=1.3)

add.scale.bar()

title(main="(a) full data set",adj=0)

hds<-c("(b)","(c)","(d)","(e)","(f)")

j<-0

for(i in best){

j<-j+1

plot.phylo(midpoint(o[[1]][[i]]),use.edge.length=FALSE,cex=1.3,tip.color=cols[rank(o[[1]][[i]]$tip.label)])

title(main=paste(hds[j]," ",round(100*props[i],1),"% of windows",sep=""),adj=0)

}

dev.off()

## null test

obs<-sum(o[[2]][1:(n-1)]==o[[2]][2:n])

x<-o[[2]]

null<-rep(NA,1000)

for(i in 1:1000){

x<-sample(x,n,replace=FALSE)

null[i]<-sum(x[1:(n-1)]==x[2:n])

}

pdf("allphylo.pdf",width=8,height=10)

par(mfrow=c(3,2))

for(i in 1:14){

plot.phylo(midpoint(o[[1]][[i]]),use.edge.length=FALSE,cex=1.3,tip.color=cols[rank(o[[1]][[i]]$tip.label)])

title(main=paste(round(100*props[i],1),"% of windows",sep=""),adj=0)

}

dev.off()

#####

myUniqueTrees<-function (x, incomparables = FALSE, use.edge.length = FALSE,

use.tip.label = TRUE)

{

n <- length(x)

keep <- 1L

old.index <- seq_len(n)

for (i in 2:n) {

already.seen <- FALSE

for (j in keep) {

if (all.equal(x[[j]], x[[i]], use.edge.length = use.edge.length,

use.tip.label = use.tip.label)) {

already.seen <- TRUE

old.index[i] <- j

cnts[j]<-cnts[j]+1

break

}

if (!already.seen) {

keep <- c(keep, i)

cnts<-c(cnts,1) ## here

}

res <- x[keep]

attr(res, "old.index") <- old.index

out <- list(res,old.index)

out

}

Page updated

Google Sites

Report abuse