## Maximum Parsimony & Distance Analysis in R and Phylip## Parsimony analysis in RWe will be using package
`phangorn ` for parsimony analysis in R. This package has several functions for calculating parsimony scores, and implements a couple different tree-search algorithms.First, install the package phangorn (you can use
`installed.packages() ` function to check if it is already installed)Second, load the library:
`> library(phangorn)` Now, import the alignment you used for the your homework (note, we used read.dna to get data in phy format used by ape; the two formats can be interconverted with phyDat and as.DNAbin functions)
`> aln <- read.phyDat("homework.txt")` Sometimes your file may not be in your working directory (type
`getwd()` to figure out what it is)To open a file located outside of your working directory you have to enter the full path to the file or use
`> aln <- read.phyDat(file.choose())` We will also need to get our tree. We can read it from the disk or just copy and paste it directly into the command: `> homework.tre <- read.tree(text="` (((Dolphin,Sperm_whale),Blue_whale),((Hippopotamus,Pig),(((Cow,(Sheep,Goat)),Giraffe),Camel)));")Now calculate the parsimony score with the parsimony command (by default, this function uses the fitch method): `> parsimony (homework.tre, aln)` The score you got is probably different from the parsimony score you calculated in your homework. This is because the parsimony function uses Fitch algorithm by default. To calculate the parsimony score using the Sankoff algorithm we need to define a scoring matrix: `> scoring <- matrix(` ` c(0,3,1,3,` ` 3,0,3,1,` ` 1,3,0,3,` ` 3,1,3,0),` ` nrow=4,` `ncol=4)` and indicate the sankoff method in the parsimony command: `> parsimony (homework.tre, aln, method="sankoff", cost=scoring)` Did you get the same score for your homework? So far, we found the parsimony score for a given tree. You can use the `> nni.tre <- optim.parsimony(homework.tre, aln, method="sankoff", cost=scoring)` What's the score of this tree? Is the nni.tre different from homework.tre? Try this: `> cophyloplot(homework.tre,nni.tre)` We can get a random starting tree for a parsimony search using the `> ra.tre = random.addition(aln)` What's its parsimony score and how does it look? Check if you can find a better tree by rearranging the ra.tre with the optim.parsimony function. Phangorn does not have an option to do an exhaustive search. But another library (phytools) does. So we want to install it. However, you may be wondering if it is possible to install all the relevant packages at once. Yes, it is, and here are the commands to do this. However, let's not do it in class because it will take some time to install all the libraries! `> install.packages('ctv')` `> library('ctv')` `> install.views('Phylogenetics')` Use the old-fashioned way to install the library("phytools") OK, to the exhaustive search now: `> ex.tre <- exhaustiveMP(aln, tree=NULL, method="exhaustive")` How many trees are we expecting for 10 taxa? Check with the It is probably taking to long, so we need to killl this command (Esc) and use the branch and bound search instead: `> bandb.tre <- exhaustiveMP(aln, tree=ra.tre, method="branch.and.bound")` Notice that we are using our ra.tre as the "bound". Try to use homework.tre instead. How many most parsimonious trees did you find? Let's build a strict consensus tree out of them: `> strict.tre <- consensus(bandb.tre, p=1.0)` Change p to 0.5 to calculate the "majority rule" consensus tree. There are a couple of indices we can calculate for a tree in parsimony analysis: `> CI(homework.tre, aln, cost = NULL) #calculates the consistency index` `> RI(homework.tre, aln, cost = NULL) #calculates the retention index` The [per-character] consistency index ci = m/s, where m = minimum number of steps on any cladogram (m = # character states -1) and s = actual number of steps on a particular cladogram. The ensemble consistency index CI is a similar index summed over all characters. The [per-character] retention index ri = (g-s)/(g-m), where m and s are as above and g is the maximal number of steps for the character on any cladogram. The ensemble retention index RI is a similar index summed over all characters: RI = (G-S)/(G-M). ## Phylogenetic inference using DistancesWe will demonstrate the use of distance matrices and distance methods in R using
`phangorn` . This package has lots of great functions for calculating distances, and implements a large number of substitution models. It also implements a variety of distance-based tree-building algorithms, such as UPGMA, Neighbor-Joining and FastME. To get started, we will run UPGMA and NJ analysis on the same dataset we used in class (carnivore.distances.csv). Read this file with the following command:
`> carnDist <- read.csv("carnivore.distances.csv", header = TRUE, row.names = 1)` Now build UPGMA and NJ trees based on these sequences: `> carnUpgma.tree <- upgma(carnDist)` `> carnNJ.tree <- NJ(carnDist)` You might have noticed that the NJ command would not run on carnDist dataset. This is because the function read.csv we used creates a 'data.frame' object while the NJ function expects a matrix. We'll use as.matrix function to convert the data.frame into a matrix: `> carnDist <- as.matrix(read.csv("carnivore.distances.csv", header = TRUE, row.names = 1))` Try repeating the commands for UPGMA and NJ trees. Use ` cophyloplot` command to compare these trees.## UPGMA vs. NJAs discussed in class, a nice property of NJ algorithm is that it is guaranteed to reconstruct the correct tree when the true (patristic) distances are known. The following exercise (borrowed from the creator of phytools) will demonstrate this property (not shared by UPGMA). We first generate a random tree: `> tree <- rtree(n = 10, rooted = FALSE)` Second, we measure the distances on that tree `> D <- cophenetic(tree)` `> round(D, 3)` We can now calculate NJ and UPGMA trees: `> nj.tree <- NJ(D)` `> upgma.tree <- upgma(D)` `> upgma.tree <- unroot(upgma.tree)` Use the following code to display the trees: `> par(mfrow = c(1, 3))` `plot(tree, type = "unrooted", edge.width = 2, mar = c(0.1, 0.1, 4.1, 0.1), cex = 1)` `title(main = "true tree")` `plot(nj.tree, type = "unrooted", edge.width = 2, mar = c(0.1, 0.1, 4.1, 0.1), ` `cex = 1)` `title(main = "NJ tree")` `plot(upgma.tree, type = "unrooted", edge.width = 2, mar = c(0.1, 0.1, 4.1, 0.1), ` `cex = 1)` `title(main = "UPGMA tree")` Now we can check if the trees are equal: `> all.equal.phylo(tree, nj.tree)` `> all.equal.phylo(tree, upgma.tree)` and calculate topological distances between them: `> RF.dist(tree, nj.tree)` `> RF.dist(tree, upgma.tree)` So far we used some distances given to us or calculated from trees. In this part of the exercise we will calculate distances ourselves. But first, we will explore the relationship between the observed proportion of substitutions (p-distance, p) as a function of the actual number of substitutions (actual evolutionary distance, d). The first distance correction we studied was based on the Jukes Cantor model (JC69):This formula gave us the maximum likelihood estimate of the actual distance as a function of the p-distance statistic. We can study the behavior of the Jukes Cantor p-distance in R using the `curve` function.`> curve(3/4*(1-exp(-(4/3)*d)), xlab="Actual Distance", ylab="p-distance estimate", xname="d",xlim=c(0,5))` You can see that as actual evolutionary distance increases, the p-distance estimate also increases, but levels off at a certain value and never increases beyond this point. This is called the saturation point, the distance expected between completely random sequences.
Recall the Kimura 2 Parameter model (K80) which allows rates of transitions and transversions to differ. We can describe this model using the transition transversion ratio κ Let's look at how the K80 distance compares with the JC69 distance. The formula for the K80 p assuming a particular value of κ is: For now let's assume κ = 10 and plot the K80 distance: `> k=10; ` curve(3/4-(1/2)*exp(-(2*k+1)*p/(k+1))-(1/4)*exp(-2*p/(k+1)), add=TRUE, xname="p", lty=2)Which distance estimate saturates more quickly, JC69 or K80? Why?## Phylogenetic inference using DistancesOK, we are ready to estimate distances from molecular data. To get started, first import a phylip-formatted alignment into a variable called `aln`
`> aln <- read.dna("cob_nt.phy")` We will compute distance matrices for this alignment using the function `dist.dna` . Take a look at the help page for this function`> help(dist.dna)` You can see that some models we have discussed, such as GTR, are not available and that is because an analytical formula is not available for the maximum likelihood estimate (MLE) of the distance under these models. Instead, numerical approximations of the MLE's have to be found for these distances. We will discuss this issue later when we talk about Maximum Likelihood methods.
Compute a distance matrix using some substitution model, say Jukes-Cantor, and store it in a variable called
`dm` :`> dm <- dist.dna(aln, model="JC69") ` You can take a look at this distance matrix by simply typing
`dm` . Then, calculate an UPGMA tree from this distance matrix:`> upgma.tre <- upgma(dm) ` Take a look at the tree:
`> plot(upgma.tre)` Now try the Neighbor Joining algorithm:
`> nj.tre <- nj(dm)` `> plot(nj.tre)` Now let's try a different model, say the Kimura 2 Parameter model (K80). We'll also model across-site rate variation using a gamma distribution. You can visualize a gamma distribution with the following code `> shape <- 1.0` `> x <- seq(0, 10, length=100)` `> y <- dgamma(x,shape)` `> plot(x,y,type="l")` (This creates a sequence of x values from 0 to 10 and stores them in
`x` . Then for the y values, `dgamma` calculates thedensity at each value of
`x` from a gamma distribution with shape parameter `shape` )What assumption are you making when you use a shape parameter of 1.0?Give
`dist.dna` a value for the shape parameter of the gamma distribution and plot the resulting tree:`> dm <- dist.dna(aln, model="K80", gamma=1.0) ` `> tree <- nj(dm) ` `> plot(tree)` Does the tree look different when you use a different model? Try building several trees using several different models and shape parameters.
You can save a tree to a file with the write.tree command. For example:
`> write.tree(tree,file="cob_nt_K80.tre")` This will create a Newick format file called
`cob_nt_K80.tre ` containing the tree in the tree variable. You can open this file using FigTree to visualize the tree in more detail and/or export a pdf of the tree. Alternatively, you can export a pdf file of the tree from within R using the following commands:
`> pdf(file="tree.pdf")` `> plot(tree)` `> dev.off()` This will create a file called `tree.pdf` containing the output of plot(tree)## Distance analysis in Phylip## Finally, we will repeat the distance analysis with Phylip. PHYLIP is a package of phylogenetic programs written by Joe Felsenstein group and first released in 1980. The programs can infer phylogenies by parsimony, compatibility, distance matrix methods, and likelihood. One can also compute consensus trees, compute distances between trees, draw trees, resample data sets by bootstrapping or jackknifing, edit trees, and compute distance matrices. Various datasets can be used, including nucleotide sequences, protein sequences, gene frequencies, restriction sites, restriction fragments, distances, discrete characters, and continuous characters. One of the main advantages of this package is its thorough, well organized, and up to date documentation. You can develop you PHYLIP skills with this excellent tutorial.Here is a general outline of the procedure we will follow. We already have an alignment, so we can skip the first step. For the rest of the steps we will be using several small programs inside the package: dnadist, neighbor, and, optionally, seqboot and consense. The programs should be installed on your computer, but you need to run the following commands from your home directory (just type `> echo 'export PATH="$PATH:~/Dropbox/DO_NOT_MODIFY/.bin/phylip_exe"' >> ~/.bashrc` `> source ~/.bashrc` |

Computer labs > Old labs >