Comparing Two Categories
The discussion of the analysis of univariate data has been concerned primarily with describing sets of observations. Most of what we do with univariate data analysis is to develop a set of values that describe the general characteristics of the distribution of the data. For example, we may have been fishing in a lake and want to describe the weight of the fish that we caught. We would calculate both the mean weight (to indicate the central tendency of our data) and the variance in the weights. As part of this procedure, we generally use a statistical test for the normality of our weight data just to make sure that the descriptive values can be applied to our data.
In previous exercises, you has seen how a classification variable can be used to divide a set of data into groups. These groups can then be shown separately, each as a frequency distribution. This provided a way to do a visual comparison of the groups. In the fishing example, we might want to use a classification variable containing the type of fish bait that was used and then generate frequency distributions of the weights, with one plot for each type of bait.
Now we are now going to look at a way to decide if groups of data are different or not. For example, we want to be able to ask the question of whether the fish caught with one type of bait are larger than those caught with another bait. In this example, we will consider a relatively simple situation: a comparison between just two groups (two different baits). In the next chapter we will extend our comparisons to more than two groups.
Data Matrix Characteristics
Two variables are used when comparing two categories. One variable has the measurement values and the other has the categorization values. A typical data matrix is:
OBS WEIGHT BAIT ------ 1 112.3 LIVE 2 144.2 LIVE 3 98.1 FLY 4 139.8 LIVE 5 106.4 FLY
There are several important characteristics of this sort of data matrix that are necessary in order to use the statistical tests described in this chapter.
There must be only two values for the categorization variable. In this example, the categorization variable is BAIT and there are only two values, LIVE and FLY. If there are three or more values, the problem must be handled differently: see the next chapter on Comparing Multiple Categories (page 153).
If the data matrix is divided into two parts based on the values of the categorization variable, the values in each of these parts should be normally distributed. In this example, it would be necessary to subset the data matrix with the value of BAIT and perform separate normality tests using PROC UNIVARIATE. In practice, while normality is very desirable, the statistical test that is used is known for being "robust," meaning that it will likely provide an accurate answer even if there is some considerable departure from normality.
The statistical procedure for doing the comparisions is found in PROC TTEST. This is the SAS procedure for performing Student's t-test on unpaired data. Unpaired data simply mean that there is no particular relationship between the individual values in the two sets. In contrast, paired data consist of measurements like those taken before and after an experimental treatment on an individual.
Before examining how TTEST is run, it is useful to examine several sets of example data to develop a general visual model of how two-category comparisons can be made.
The comparisons are performed on the descriptions of each of the frequency distributions. Therefore, you can picture the comparison being made between two normal curves. Each normal curve can be drawn if you know its mean and variance.
Consider three situations.
A. If you plot two normal distributions that are almost the same (Le. they have similar means and variances), you would see that their "bell-shaped" curves are almost on top of each other.
B. If you plot two distributions that have similar variances but different means, then their curves would have the same width but the peaks would be offset from each other.
C. If you plot two distributions that have similar mean values, but different variances, then their curves would peak at about the same place but one would be wider than the other.
These three situations are shown in the following figure. The "L" and "F" refer to the two curves being compared; these are the "LIVE" and "FLY" measurements as shown in the data matrix on page 145.
| case A LL case B
+ L L LL FF I L F F I LFF L
+ L LF F I F F I L F I L FL I L F I F F I L F
+ L FL + L F I I L F I FL FL I L F I F F I L F I L L
+ L F I F F I L F F + FL L I L I L I L F L F I F L
+ L F L F I FFL LF I L F F I F L LF I L F LL F I FFLL LLF I LL FF LL F
+LLLLLLL LLL + LLFFFF LLLLL
-+------+------+------+------+- -+------+------+------+------+-
0 50 100 150 200 0 50 100 150 200
WEIGHT WEIGHT
case C L L L + L I I L I + L I I L I FFF + FF L I F F I L F I FF L FF + F F I F L F I FF L L FF I FFF L L FF +LLLLLLLLL LLLLLL -+------+------+------+------+- o 50 100 150 200 WEIGHT
In the first case (A), there is "no difference" between the curves. Indeed, they appear to be very similar. Both the peaks and widths are very similar.
In the other two cases (B & C), the curves are different. Note that the two cases differ for two reasons: in one they differ because of their means and the other they differ because of their variances.
This leads us to the way that we can test for differences between two distributions. Here is the general procedure:
1. Examine the hypothesis that the variances of the two distributions are the same. An F-test is used to test the differences in variances. The way that you use this test is to look at the probability value (listed as the values following PROB >F' =) that is 0.05 or smaller.
a. If the PROB >F' is 0.05 or smaller you can assume that the variances are "unequal." Technically, you are concluding that differences as large as you have measured between the variances would occur by chance alone only 5% of the time (or less if you had a smaller probability listed).
b. If the PROB >F' is larger than 0.05 you can conclude that the observed differences are not large enough for the two distributions to have "unequal" variances. You can conclude that the variances are "equal."
2. Examine the hypothesis that the means of the two distributions are the same. A t-test is used to test the difference in means. There are two t-values and associated probabilities given in PROC TTEST. You choose the one that corresponds to the results that you got the F-test that you have just performed (Part 1 above).
a. If the PROB > IT| is 0.05 or smaller, it indicates a significant difference between the means of the two distributions. Here, too, you are technically concluding that differences this great between two means will occur by chance alone only 5% of the time (or less if you have a smaller probability value).
b. If the PROB > IT| is larger than 0.05 you can conclude that the observed differences are not large enough for the two distributions to be different. You can assume that the means are the "same."
3. You should draw your conclusions about the similarity of the two distributions using evidence from both the F- and t-tests.
These decisions can be shown with the TTEST output for the three example cases.