OneWay_Anova_with_TukeyHSD_Rcode_tutorial

Tukey HSD R Code and Tutorial

Tutorial and Code for conducting Tukey HSD, Scheffe, Bonferroni and Holm multiple comparison in the R statistical package

R installation instructions:

Google the search term "r-project".

locate The R Project for Statistical Computing at www.r-project.org.

Download R for Windows and install on your PC. Select 32-bit or 64-bit checkbox to suit your PC.

There is absolutely no need to install both 32-bit and 64-bit versions of R.

You will have desktop and start menu icons for R after succesful installation of R.

R code and execution instructions:

Right click on the R icon and select "Run as Administrator". This "Run as Adminstrator" choice in Windows is absolutely needed to correctly install the R library package for multiple comparison. R opens a window with a red > prompt that awaits user input from keyboard. Type or copy-paste the R command at the R prompt:

Command

install.packages("multcomp")

Output

You will be asked to select a CRAN mirror site to download the R package for multiple comparison. The output will look something like this:

--- Please select a CRAN mirror for use in this session --- trying URL 'http://cran.cnr.Berkeley.edu/bin/windows/contrib/3.1/multcomp_1.3-7.zip' Content type 'application/zip' length 621682 bytes (607 Kb) opened URL downloaded 607 Kb package 'multcomp' successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\navendu\AppData\Local\Temp\RtmpIpxrei\downloaded_packages

Next, in Excel, stack your input data into two columns: the first column is the numerical observation; the second column contains a character-string label for the distinct category to which the observation belongs. The first row contains the column names, which we have chosen to be Y and X in our demo. Next, copy-paste this range that spans 2 Excel columns, from its first row indicating the column names Y and X, up to the last row, into Notepad. The stacked input data for our demo example looks like this in Notepad, when from copied from Excel and pasted into Notepad:

Y X 0.28551035 A 0.338524035 A 0.088313218 A 0.205930807 A 0.363240102 A 0.52173913 B 0.763358779 B 0.32546786 B 0.425305688 B 0.378071834 B 0.989119683 C 1.192718142 C 0.788288288 C 0.549176236 C 0.544588155 C 1.26705653 D 1.625320787 D 1.266108976 D 1.154187629 D 1.268498943 D 1.069518717 D

Save newly created Notepad text file in your preferred directory; create a new directory if required. It may not be immediately obvious that the two columns of excel are represented by a tab character separator (delimiter) between the two values in all the rows of the Notepad-generated text file. R recognizes this separator as representing two columns as a tab-delimited text file, when reading, digesting and parsing this file and storing it in its memory. We have saved our stacked demo data in 2 columns as a Notepad file named demo.txt in our preferred directory. The R window has a top menu bar. Select the menu File-Change Dir Point to the directory or folder where you saved your input data file (this is named demo.txt in our example). R subsequently takes all input files from that directory, and saves all R output files, if required, to that same directory. Now gently persuade R to read your input file into its memory. Our demo example input file is named demo.txt, it resides in our designated directory of choice, which is the same directory that we point at in the R main menu File-Change Dir. Warning: if you save your file somewhere, and R does not know what that exact directory is, you will immediately get error messages. Except for Google's computer, all computers are totally unintelligent when it comes to figuring out themselves (without being told) where exactly their starting input data feed resides.

Command

expt1 <- read.table("demo.txt",header=T)

Output

There is no output if this step is successful. Your input data in the file demo.txt now resides in a R data structure named "expt1". If there is an error message, it usually relates to R being unable to find the file or directory. Fix this by pointing to the correct directory in the main R menu File-Change Dir.

R users are in general agreement that the most difficult steps in R involve getting external input data to be loaded into the memory of R. You have overcome the biggest hurdle in this endeavor. Your input data in the file demo.txt now resides as a R data structure named "expt1". The rest is easy.

Command

amod <- aov(Y~X,data=expt1)

Output

There is no output when this step is successful. Y is the numerical data observations. X is the categorical variable, of the distinct names of each treatment category. The results of one-way Anova are saved in a R data structure named "amod".

The next command displays the results of the Anova, which may be captured by copy-paste to Excel, subsequently parsed into neat columns by the Excel menu Data-Text to Columns.

Command

summary(amod)

Output

Df Sum Sq Mean Sq F value Pr(>F) X 3 3.244 1.0815 27.59 9.26e-07 *** Residuals 17 0.666 0.0392 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The demo.txt input data is showing in one-way Anova that at least one of the pairs of treatments is significantly different, with extremely low p-value, well below 0.001, suggesting that the next step of Tukey HSD, Scheffe, Bonferroni and Holm methods will almost surely reveal the significantly different pair(s).

Now load the multiple comparison library of R, which was installed in the first step, into the memory of R.

Command

library("multcomp")

Output

Loading required package: mvtnorm Loading required package: survival Loading required package: splines Loading required package: TH.data

Now ask R to conduct Tukey HSD multiple comparison (2 commands here):

Command

tmod <- glht(amod,linfct=mcp(X="Tukey")) summary(tmod)

Output

Simultaneous Tests for General Linear Hypotheses Multiple Comparisons of Means: Tukey Contrasts Fit: aov(formula = Y ~ X, data = expt1) Linear Hypotheses: Estimate Std. Error t value Pr(>|t|) B - A == 0 0.2265 0.1252 1.809 0.30311 C - A == 0 0.5565 0.1252 4.444 0.00170 ** D - A == 0 1.0188 0.1199 8.499 < 0.001 *** C - B == 0 0.3300 0.1252 2.636 0.07455 . D - B == 0 0.7923 0.1199 6.610 < 0.001 *** D - C == 0 0.4623 0.1199 3.857 0.00614 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Adjusted p values reported -- single-step method)

R has produced the p-values of Tukey HSD multiple comparison for all pairs! R has an alternate built-in function that provides the p-value as well as the Tukey HSD range. We repeat the Tukey HSD test with this alternate function which produces slightly different p-values at the 4th place of decimal.

Command

TukeyHSD(amod, conf.level=0.95)

Output

Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = Y ~ X, data = expt1) $X diff lwr upr p adj B-A 0.2264850 -0.12942036 0.5823903 0.3032353 C-A 0.5564744 0.20056909 0.9123797 0.0018291 D-A 1.0188116 0.67805817 1.3595650 0.0000009 C-B 0.3299894 -0.02591587 0.6858948 0.0744333 D-B 0.7923266 0.45157321 1.1330800 0.0000242 D-C 0.4623372 0.12158377 0.8030906 0.0062802

The same command is repeated, except for changing the significance level to 0.99:

Command

TukeyHSD(amod, conf.level=0.99)

Output

Tukey multiple comparisons of means 99% family-wise confidence level Fit: aov(formula = Y ~ X, data = expt1) $X diff lwr upr p adj B-A 0.2264850 -0.22856704 0.6815369 0.3032353 C-A 0.5564744 0.10142240 1.0115264 0.0018291 D-A 1.0188116 0.58313245 1.4544907 0.0000009 C-B 0.3299894 -0.12506255 0.7850414 0.0744333 D-B 0.7923266 0.35664749 1.2280057 0.0000242 D-C 0.4623372 0.02665805 0.8980163 0.0062802

Now onwards to Scheffe multiple comparison.

Unfortunately, R does not have a direct procedure or function or parameter for multiple comparison according to the Scheffe method. Fortunately, the R multcomp package procedure that we created to conduct Tukey HSD gives an output for each pair in the column "t-value". We denote this "t value" quantity for each pair as T. The NIST Engineering Statistics Handbook page for Scheffe's method provides a formula which directly leads to the Scheffe p-value corresponding to an observed value of T as:

1 - F [T^2 /(k-1), k-1, ν)] where F[.] is the cumulative F distribution, and where (k-1) and ν are the two degrees of freedom parameters of the F distribution. Note that k is the number of treatments and ν is the degrees of freedom of error that were established earlier. We construct 1 line of R code to capture the array of the "t value" from the Tukey HSD output in the previous step and name it as the array (the T-statistic). One line of R code line of code saves these "t values" into the array T, and one line of R code displays the contents of the array T :

Command

T <- summary(tmod)$test$tstat as.data.frame(T)

Output

T B - A 1.808899 C - A 4.444472 D - A 8.498907 C - B 2.635573 D - B 6.609574 D - C 3.856808

We capture the number of columns k and the degrees of freedom of error ν from the output of Anova which resides in the R memory variable that we named as "amod". Then we compute and display the p-value for each pair according to the Scheffe multiple comparison method. The F distribution is evaluated using the R built-in function named pf() for the F cumulative distribution.

Command

k <- amod$rank v <- amod$df.residual pValScheffe <- 1-pf(T**2/(k-1),k-1,v) as.data.frame(pValScheffe)

Output

pValScheffe B - A 3.798759e-01 C - A 3.742077e-03 D - A 2.368168e-06 C - B 1.122681e-01 D - B 5.962367e-05 D - C 1.185196e-02

We now have the Scheffe multiple comparison p-value for all pairs in exponent format, which may be excused in instances of complex open-source contributed software that is entirely free. You may briefly think exponentially to discern the pairs bearing Scheffe p-value less than 5.0e-02, and further those with Scheffe p-value less than 1.0e-02. Finally, onwards to Bonferroni simultaneous multiple comparison of q pairs. We consider two sets of contrasts:

(1) All pairs (same contrasts as Tukey HSD). Here q=6 for our demo.txt input data.

(2) Only pairs relative to A, assuming A is control. Here q=3 for our demo.txt input data.

Bonferroni and Holm methods in R multcomp procedure require an elaborate setup of all pairs under consideration in the form of a Contrast Matrix, which is constructed below. The entire set of 8 segments of the single line of R code may be pasted into the R prompt.

Now invoke the R procedure to produce Bonferroni simultaneous multiple comparison for all pairs.

Command

bmod <-glht(amod,linfct=mcp(X=contrasts)) summary(bmod,test=adjusted("bonferroni"))

Output

Simultaneous Tests for General Linear Hypotheses Multiple Comparisons of Means: User-defined Contrasts Fit: aov(formula = Y ~ X, data = expt1) Linear Hypotheses: Estimate Std. Error t value Pr(>|t|) B - A == 0 0.2265 0.1252 1.809 0.52912 C - A == 0 0.5565 0.1252 4.444 0.00213 ** D - A == 0 1.0188 0.1199 8.499 9.52e-07 *** C - B == 0 0.3300 0.1252 2.636 0.10412 D - B == 0 0.7923 0.1199 6.610 2.65e-05 *** D - C == 0 0.4623 0.1199 3.857 0.00759 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Adjusted p values reported -- bonferroni method)

Now invoke the R procedure to produce Holm simultaneous multiple comparison for all pairs.

Command

summary(bmod,test=adjusted("holm"))

Output

Simultaneous Tests for General Linear Hypotheses Multiple Comparisons of Means: User-defined Contrasts Fit: aov(formula = Y ~ X, data = expt1) Linear Hypotheses: Estimate Std. Error t value Pr(>|t|) B - A == 0 0.2265 0.1252 1.809 0.08819 . C - A == 0 0.5565 0.1252 4.444 0.00142 ** D - A == 0 1.0188 0.1199 8.499 9.52e-07 *** C - B == 0 0.3300 0.1252 2.636 0.03471 * D - B == 0 0.7923 0.1199 6.610 2.21e-05 *** D - C == 0 0.4623 0.1199 3.857 0.00379 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Adjusted p values reported -- holm method)

Now repeat Bonferroni and Holm comparisons for fewer relevant pairs, only pairs relative to column A, assuming that column A represents the control. The Contrast Matrix is redefined as follows and all 5 lines are copy-pasted to R:

Now invoke the R procedure to produce Bonferroni simultaneous multiple comparison for the relevant q=3 pairs of contrasts relative to A only.

Command

bmod <- glht(amod,linfct=mcp(X=contrasts)) summary(bmod,test=adjusted("bonferroni"))

Output

Simultaneous Tests for General Linear Hypotheses Multiple Comparisons of Means: User-defined Contrasts Fit: aov(formula = Y ~ X, data = expt1) Linear Hypotheses: Estimate Std. Error t value Pr(>|t|) B - A == 0 0.2265 0.1252 1.809 0.26456 C - A == 0 0.5565 0.1252 4.444 0.00107 ** D - A == 0 1.0188 0.1199 8.499 4.76e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Adjusted p values reported -- bonferroni method)

Now invoke the R procedure to produce Holm simultaneous multiple comparison for all pairs.

Command

summary(bmod,test=adjusted("holm"))

Output

Simultaneous Tests for General Linear Hypotheses Multiple Comparisons of Means: User-defined Contrasts Fit: aov(formula = Y ~ X, data = expt1) Linear Hypotheses: Estimate Std. Error t value Pr(>|t|) B - A == 0 0.2265 0.1252 1.809 0.088187 . C - A == 0 0.5565 0.1252 4.444 0.000711 *** D - A == 0 1.0188 0.1199 8.499 4.76e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Adjusted p values reported -- holm method)

Note that the Bonferroni and Holm p-values are lower for (the same pairs) as the number of contrasts (pairs) under simultaneous comparison is smaller. The Contrast Matrix may be re-defined for any other set of orthogonal contrasts. Done! We have harnessed R, fed the input data to R, and obtained outputs for post-hoc Tukey HSD, Scheffe, Bonferroni and Holm multiple comparison analysis.

Summary: Demonstrated, tested and working R Code for conducting Tukey HSD, Scheffe, Bonferroni and Holm multiple comparison

We have combined the above lines of R code and put them together below in the same correct sequence. As long as you have opened R with right-mouse-click selection "Run as Administrator" and changed the R directory to the relevant directory where your input file demo.txt is saved, you may copy-paste this entire set of R code lines at the R prompt, which will execute them sequentially. Please feel free to rename any of the variables, as long as the new name does not contain blank spaces. The treatments in the demo are generically named A, B, C and D. You may name your tretments anything -- just don't keep blank spaces in the treatment names. It is suggested to use the underscore character if you need something that serves the purpose of a blank space within a R data name.

install.packages("multcomp") expt1 <- read.table("demo.txt",header=T) amod <- aov(Y~X,data=expt1) summary(amod) library("multcomp") tmod <- glht(amod,linfct=mcp(X="Tukey")) summary(tmod) TukeyHSD(amod, conf.level=0.95) TukeyHSD(amod, conf.level=0.99) T <- summary(tmod)$test$tstat as.data.frame(T) k <- amod$rank v <- amod$df.residual pValScheffe <- 1-pf(T**2/(k-1),k-1,v) as.data.frame(pValScheffe) contrasts <- rbind( "B - A" = c(-1,1,0,0), "C - A" = c(-1,0,1,0), "D - A" = c(-1,0,0,1), "C - B" = c(0,-1,1,0), "D - B" = c(0,-1,0,1), "D - C" = c(0,0,-1,1) ) contrasts bmod <- glht(amod,linfct=mcp(X=contrasts)) summary(bmod,test=adjusted("bonferroni")) summary(bmod,test=adjusted("holm")) contrasts <- rbind( "B - A" = c(-1,1,0,0), "C - A" = c(-1,0,1,0), "D - A" = c(-1,0,0,1) ) bmod <- glht(amod,linfct=mcp(X=contrasts)) summary(bmod,test=adjusted("bonferroni")) summary(bmod,test=adjusted("holm"))

Attribution

e-mail: navendu <dot> vasavada <at> alumni <dot> upenn <dot> edu

Page updated

Google Sites

Report abuse