When categorical data appear in textbooks, it is usually already summarized in tables or graphs. Hence, you usually do not need technology to do homework problems with categorical data. However, this leaves one underprepared for dealing with real data, so this tutorial is for those who need to do that. We will use an example dataset small enough so you can do the calculations by hand and compare your results to the computer. Imagine a survey question with answer choices Agree, Disagree or Undecided. Suppose 25 people give these responses:
A,A,D,U,D,D,A,U,A,D,A,D,D,A,U,A,U,D,D,A,A,A,U,D,A.
Mode
Most software will not report the mode. That's because the mode is rarely useful for measurements. To find it when you do need it, you have to treat the data as categorical. For categorical data, the modal category is the one with the most observations (if there is such a category). You can see by counting that there are more A's on the list above than D's or U's, so A is the modal category. This is the shortest summary for categorical data, analogous to just giving the mean or median for measurements. When we find the modal category for a group of measurements, it is called the mode. It is useful only when the measurements resemble categorical data in having values that are repeated over and over. An example might be number of children in a family. Here you might see 0, 1, 2... over and over. For more typical measurements, such as these
1.66597, 1.91566, 2.53406, 2.88043, 2.93449, 3.08816, 1.73520, 3.21908, 3.77892, 3.98208
the mode is not useful because there is none. No value is repeated.
If you need the mode, make a histogram (or frequency table) for the data and find the category with the most observations.
Tables
Run R. Use quotation marks to enter the data as text.
> survey = c("A","A","D","U","D","D","A","U","A","D","A","D","D","A","U","A","U","D","D","A","A","A","U","D","A")
> survey
> table(survey)
The modal category is "A" (agree).
Bar Graphs and Pie Charts
Graphics have to be made from the numbers in such a table as the one above rather than the letters in the variable.
> freq = table(survey)
> freq
> survey
> barplot(freq)
> barplot(freq, main="Barplot of Survey Responses")
> pie(table(survey))
Notice that it is obvious from the bar graph that A is the modal category. It takes sharp eyes to see this in the pie chart. The summaries above are in order of decreasing statistical quality. A table gives the most and most precise information in the least amount of space; a pie chart gives the least. Bar graphs and pie charts should be used over a table when a visual display is needed, e.g. a presentation to a non-technical audience with time constraints (it will probably be easier to display, make your point quickly, and move on). In a report, on the other hand, the table may be more effective. The only reason to use the pie chart over the bar graph is when you want to emphasize the proportions of the whole.
> rm(survey, freq) #Clean up workspace
Exercises
This exercise uses the States95 dataset. For directions on how to access it, see the Getting Started with R tutorial.
1. Construct a table of the two categorical variables.
2. Construct a barplot of the two categorical variables, label axes.
3. Which is the better display for the variable, a table or a barplot? Answer for each variable.