We will use the heart attack data as an example. If you have not worked with this data before you can find a description here. This is a very large data set and so is provided as zip files. (You may need a program such as winzip to unzip them). Available are plain text (with tabs separating entries) and Excel versions of the data. If possible, download and unzip the text version of the heart attack data and open it in R. A separate page gives details on reading text files into R using this data as an example.
R prefers to work with the counts rather than the raw data. If you do not have the counts, but you have the data in variables in R, you can use the R table command to get the counts. This also checks for some types of gross errors, such as an 11 in a column that is supposed to be 0-1. You have to reattach the heart attack data each time you open R.
> table(SEX,DIED) Error in table(SEX, DIED) : object "SEX" not found > attach(heartatk) > table(SEX,DIED) DIED SEX 0 1 F 4298 767 M 7136 643
R is being mean and not returning the row and column totals. We can get those with repeated use of table and length.
> table(SEX) SEX F M 5065 7779 > table(DIED) DIED 0 1 11434 1410 > length(SEX) [1] 12844
It would make a good exercise to put these totals in their proper place in the original table. Page layout is no guide to what is statistically correct!
We will compare the mortality rates of males and females. This amounts to labeling death as "success". We need the numbers who died in each group for x and the total number of people in each group (total males and total females) for n. Make sure you enter these in consistent order. We are very old and followed the old-fashioned rule of "ladies first" -- for both x and n.
> prop.test(x=c(767,643),n=c(5065,7779)) 2-sample test for equality of proportions with continuity correction data: c(767, 643) out of c(5065, 7779) X-squared = 147.7612, df = 1, p-value < 2.2e-16 alternative hypothesis: two.sided 95 percent confidence interval: 0.05699518 0.08055073 sample estimates: prop 1 prop 2 0.15143139 0.08265844
"Ladies first" means we subtracted F-M so the positive numbers in the confidence interval mean the mortality rate was higher for women. The fact that it does not include zero means that the difference is unlikely to be due to sampling error. (Bear in mind that this is not a random sample so we need to be very cautious in extrapolating to other states or years.) The tiny p-value confirms this. Note that you get both a confidence interval and a hypothesis test with one command. The results may not agree with examples in textbooks because R is doing some behind-the-scenes tweaking that is not worth the trouble if you are doing the calculations by hand.
Exercises