This tutorial uses the PULSE data. If you are not already familiar with it, you should read a description of the data. You can find the entire dataset at Hayden's site as a plain text file or as an Excel spreadsheet.
Two ways to load the data are:
1. Get the URL of the PULSE datafile on your clipboard (from my machine I right-click the link and select “copy link address”, which will be: http://statland.org/Software_Help/R/pulse.txt
Then, from RStudio, upper right window, Workspace tab
> Import Dataset > From Web URL, paste URL and follow directions
2. Download the text data file. If necessary, follow the directions on my LoadData video tutorials. The summary is: From RStudio, upper right window, Workspace tab
> Import Dataset > From text file, and follow directions
> attach(PULSE) > names(PULSE)
You can get an assortment of summary statistics for all your categorical variables by using the summary command. The variables R recognizes as categorical (because they are text) are Ran., Smokes. and Sex. (The periods are there because the source file has question marks in those locations and question marks are illegal in R variable names.)
> summary(Ran.) > summary(Smokes.) > summary(Sex)
A table is usually the best summary for categorical data. Once we have a table we should look at it and say something sensible. A very short verbal summary here is that most of these students are men, most of them did not run, and most of them do not smoke. We should also note anything unusual or unexpected in the data. Here we might wonder about the imbalance between the sexes. At least in the United States, college students have been roughly evenly balanced between the sexes for decades, although in recent times a trend of higher proportion of females has emerged. Why such a preponderance of males here? Is this a course for engineering majors? Were the data gathered many years ago, or in a country where fewer women go to college? These are the kinds of things a good analyst looks for and questions. We might have similar questions about how few ran. Was this really decided by a fair coin toss, or did we have some non-compliance for this "treatment"?
Two-Way Tables
Now let's look at the relationship between the two categorical variables Sex and Smokes.
> table(Sex, Smokes.)
Such tables usually include row and column totals. To get those we use a powerful R idea: save the results of a procedure and pass it to another.
> tab = table(Sex, Smokes.) > addmargins(tab)
Of course there is no point in getting this table unless we can interpret it. One thing we might be interested in is whether there is a difference in the prevalence of smoking between the two sexes. 8 out of 35 females smoke while 20 out of 57 males smoke. Those are hard to compare unless we change to a common denominator, or express them as proportions or percents.
> 8/35
> 20/57
We see that about 23% of the females smoke and about 35% of the males, so smoking is more common among males in this group of students. R can do the arithmetic for you.
> prop.table(tab,1)
The "1" tells R to compare the sexes in the rows. To compare smokers to non-smokers, compute column percents.
> prop.table(tab,2)
Note that in R you can use the up-arrow key to recall previous commands, so the latest command could be created by using the up-arrow once and changing the 1 to a 2.
Now let's look at a very trivial issue that we discuss only because it often leads to confusion for beginners. Depending on how we select our variables in a two-way table, we can get different looking tables.
> table(Smokes., Sex) Sex Smokes. female male no 27 37 yes 8 20
Here the rows and columns are interchanged compared to our original table. There is no right (or wrong) way to do this! The convention is that the explanatory variable is the rows and the response is the columns. However, sometimes there is no clear explanatory or response variable, or the choice is determined by non-statistical issues like fitting the table on a page or overhead. The only reason it is worth mentioning is to warn you not to memorize any rules for working with tables that include the words "row" or "column", since the same information could be in either a row or a column, depending on how the table is laid out.
Probability and Two-Way Tables
In looking at a table we can think in terms of counts or of proportions, such as 28 out of 92 smoke. We can also think of the latter as a probability. If we pick a person at random from this group, the probability that they smoke is 28 out of 92. In some cases, this is all we want to know. In other cases, this might be an estimate of some other probability or proportion -- perhaps we have a sample value and want to look at a larger population. In what follows, we will talk mainly in terms of probabilities. We will also try to match up what we can get from the table with probability terminology and notation.
From any of our tables, we can see that the probability that a person selected at random smokes is 28/92 = 0.3 or 30%. The probability that they were male is 57 out of 92 or 62%. Simple probabilities come from the sum rows and columns. The probability that a person does not smoke can be found as 64/92 or by the complement rule as 1-(the probability that they smoke) or 1-(28/92). Both approaches should give 70%. (In theoretical work and in doing arithmetic we usually use the proportion 0.7 but when interpreting results most people prefer percentages.) It is important to recognize how these rules play out in tables, because this sort of data is almost always presented in tables!
There is also a rule for probabilities with "and', but it works only for independent events. These are important in theory but rare in practice;-) In practice we have to count. From the table, we can see that there were 8 people who were female AND smoked. Hence the correct probability for this is 8/92=8.7%. The independent event formula would give (28/92)*(35/92)=11.6% -- close but not real close. Probabilities with AND are generally found with a total percents table.
> prop.table(tab) Smokes. Sex no yes female 0.29347826 0.08695652 male 0.40217391 0.21739130 > addmargins(prop.table(tab)) Smokes. Sex no yes Sum female 0.29347826 0.08695652 0.38043478 male 0.40217391 0.21739130 0.61956522 Sum 0.69565217 0.30434783 1.00000000
The probability of being a female smoker (which we calculated a moment ago) is highlighted in red in the copy of that table immediately above. The probability of being a male (and a) non-smoker is 40.217391%.
For display, or easier reading, you can show only the first two or three decimal places of R tables,
> round(prop.table(tab),2)
> round(prop.table(tab),3)
Disjoint (also called "mutually exclusive") events connect with tables in two ways. First, when you set up each categorical variable, the categories should be disjoint. People should have just one activity level, and either they ran or they did not. If you open a well-constructed data file, this should already be taken care of. You may have to be more careful if you set up a data file yourself. For example, you may have a survey question that asks people to check a list of hobbies they have. Since people may have more than one hobby, your hobbies may not form disjoint sets. The standard way to deal with this is to represent each hobby choice with a yes-no variable.
We may also see disjointness between certain values of different variables. For example, if we are studying the prevalence of various forms of cancer and comparing males and females, we will find no males with ovarian cancer and no females with prostate cancer. These are two examples of disjoint events and we would see two 0's in the contingency table. On the other hand, when we see 0's, we always wonder if there is some reason (biological in our example) why the events are disjoint, or is the 0 just a peculiarity of this set of observations.
Conditional probabilities are computed in row percent and column percent tables. In fact, the meaning of conditional probabilities is much clearer in tables than it is in language or mathematical notation. The idea of a conditional probability is that you are looking at a subset of the data. For example, in an election poll we might be interested in the proportion of voters who prefer Candidate A, and also be interested in what that proportion is among certain subsets, such as men, women or blacks. For the pulse data, we saw that about 30% of the 92 people smoked. However, for the subgroup of females, only 8 out of 35 or about 23% smoke. Often we want to compare one subset to another. Here 20/57 males or about 35% smoke. We noted this earlier and found those numbers in the table. The notation for these conditional probabilities might look something like P(smokes | female) and P(smokes | male) respectively. These are row percents because probabilities are computed with the row totals as denominators. The subgroups are males and females.
> addmargins(prop.table(tab,1))
(Don't take that last row too seriously.) We can also compare smokers to non-smokers.
> addmargins(prop.table(tab,2))
(Don't take that last column too seriously.) 71% of the smokers were male. The notation for this conditional probability might look something like P(male |smokes). It's not the same as P(smokes | male)=35%. Now the subgroups are smokers and non-smokers. Recall that what is a row and what is a column is arbitrary, so in practice you have to ask yourself, "Do I want to compare males to females or smokers to non-smokers?" and not "Do I want row percents or column percents?" Putting these into words may help to see the difference and how these arise in practice. P(male |smokes) is about the 28 people who smoke. Of those 28, what proportion were male?
Independence is closely related to conditional probabilities. If gender and smoking were independent, then a column percents table might look like this:
no yes Rowtotal female 0.35 0.35 0.35 male 0.65 0.65 0.65
with the percent of females the same for smokers and non-smokers and for the group as a whole. "Independence" can be a tricky word in ordinary English, and is even more so in statistics. Independence in the table above means that the proportion of females is the same for both smokers and non-smokers. But smoking and gender are dependent in the sense that if I know the percentage of smokers who are female, and I know the two are independent, then I know the percentage of non-smokers who are female. Ironically, statistical independence puts very tight restrictions on what a two-way table can look like. Rarely do we see complete independence in real data and often the question is how close we come to independence. Here percentages of females between smokers and non-smokers are in the ballpark but not really close (28.6% versus 42.2%).
> detach(pulse)
Exercises
This exercise uses the States95 dataset. For directions on how to access it, see the Getting Started with R tutorial.
1. Construct a two-way table of the two categorical variables. 2. Run the command below to produce a polished double bar-plot >barplot(table(party,region),beside=TRUE,legend.text=c("Democrat","Republican"), col=c("blue","red"), xlab="Region",main="Number of States by Party in 96 Pres. Election")3. Run the command for question 2 again, but, change the first argument to "table(region,party)". What happened? 4. To improve your graph from question 3, add two additional colors of your choice. Type >colors() to see a list of available colors. If you don't care, you may use numbers, e.g. "col=1:4" or "col=2:5" to specify default colors. 5. Is there association between the categories? Explain why? What do you think it means? #(We'll learn later how to test for the presence of non-random associations such as these)