We will use the heart attack data as an example. If you have not worked with this data before you can find a description here. This is a very large data set and so is provided as zip files. (You may need a program such as winzip to unzip them). Available are plain text (with tabs separating entries) and Excel versions of the data. Download the text data file and unzip it. If using RStudio, follow the directions on my LoadData video tutorials, the summary is: From RStudio, upper right window, Workspace tab
> Import Dataset > From text file, and follow directions
When I did this using the .txt file, I got the following error:
The error reads, "more column names than row names." OK, When I counted the headings, there are eight ("LOS" is separated from AGE" by a tab charater, if it was not, this would be a problem), and there are eight columns of data -- so what gives? The problem is the first heading, "Patient #". In R, the number symbol, "#" sets off comments, and so when it tries to read in first row, it automatically takes everything after "#" as a comment. A comment is a non-coding remark placed by the programmer/user for personal reference, ignored by the computer. To fix it, open the original file, delete the "#" symbol, save it in a location you can find, and repeat the import process.
If using R alone, Hayden's Reading Tables Into R page explains how to do it using the read.table() command. If you remembered this dataset from a previous tutorial and used the Web URL import method, then the data.frame may have read in as heartatk4R. For this tutorial, we will use the name heartatk, which can be obtained by,
> heartatk = heartatk4R
Don't forget to attach your data.frame,
> attach(heartatk)
R prefers to work with the counts rather than the raw data. If you do not have counts, but you have the data in variables in R, you can use the table() command to get the counts. This also checks for some types of gross errors, such as an 11 in a column that is supposed to be 0-1. You have to reattach hte heart attack data each time you open R.
> names(heartatk) > attach(heartatk)
You can make a cross-tabulation of two categorical variables with table() and do a test of independence or homogeneity with chisq.test().
> sex.died = table(SEX,DIED) > sex.died > chisq.test(sex.died) > chisq.test(sex.died, correct=TRUE)
For both of the above tests, p-value = 2.2e-16, which is 0.00000000000000022, which is super-small! The argument "correct=FALSE" tells the chisq.test() command to not use a continuity correction, the default which gives a slightly more accurate approximation to the chi-square distribution. This is equivalent to the result you would get if you did it by hand.
With 12,844 observations, getting the table is a lot more work than computing chi-square, and it is best to let the computer do it. If you have an existing table, R will analyze it -- but not without putting up a fight. You need to enter the table one row at a time and use rbind to combine the rows into a table. Here we enter a table that looks like.
17
8
22
35
53
491
without totals.
> hep=rbind(c(17,35),c(8,53),c(22,491)) > hep > chisq.test(hep)
The warning message probably refers to a small expected count in one cell. We can check that using a standard R trick. Many R procedures compute far more than they report. You can save all that is computed to a file and then extract what you want. This allows the output of one procedure to be used as the input to another -- a very powerful feature if you program in R.
> out = chisq.test(hep) Warning message: Chi-squared approximation may be incorrect in: chisq.test(hep) > out$expected
Here is a complete list of things you could type after the $, obtained by typing
> names(out)
The definition of each item in the output list is always given in the help file, e.g.
> ?chisq.test
Here the syntax for obtaining the individual items, by using the "$" between the object and its item.
> out$statistic
> out$parameter
> out$p.value
> out$method
> out$data.name
> out$observed
> out$expected
Here we see that we have two cells with small expected counts. This means that the assumption of the test was not met (more than 80% of expected values >5 and none <1). The test is therefore invalid. To fix it, we need to combine categories.
> hep2=rbind(c(25,88),c(22,491)) > hep2 > out2=chisq.test(hep2) > out2 > out2$expected
This did the trick.
> detach(heartatk)
> rm(hep, out, sex.died, hep2, out2) #Clean up workspace
Exercises
1. Is there a relationship between the region a state is in and its political party? Use chisq.test() to find out at the 5% level of significance. Confirm that the assumptions of the test are met.
2. Run table(region,party). Combine the regions "South", "North Central", and "West" and run chisq.test() again at 5%. Do the results change? Why?