15 - Measurement and Statistical Methods

Most of the experiments described in our content involve measuring one or more variables and then analyzing the data statistically. The design and scoring of all the tests we have discussed are also based on statistical methods. Statistics is a branch of mathematics that provides techniques for sorting out quantitative facts and ways of drawing conclusions from them. Statistics let us organize and describe data quickly, guide the conclusions we draw, and help us make inferences.

Statistical analysis is essential to conducting an experiment or designing a test, but statistics can only handle numbers - groups of them. To use statistics, the psychologist first must measure things - count and express them in quantities.

SCALES OF MEASUREMENT

No matter what we are measuring - height, noise, intelligence, attitudes - we have to use a scale. The data we want to collect determine the scale we will use and, in turn, the scale we use helps determine the conclusions we can draw from our data.

Nominal Scales

A nominal scale is a set of arbitrary named or numbered categories. If we decide to classify a group of people by the color of their eyes, we are using a nominal scale. We can count how many people have blue, green, or brown eyes, and so on, but we cannot say that one group has more or less eye color than the other. the colors are simply different. Since a nominal scale is more of a way of classifying than of measuring, it is the least informative kind of scale. If we want to compare our data more precisely, we will have to use a scale that tells us more.

Ordinal Scales

If we list horses in the order in which they finish a race, we are using an ordinal scale. On an ordinal scale, data are ranked from first to last according to some criterion. An ordinal scale tells the order, but nothing about the distances between what is ranked first and second or ninth and tenth. It does not tell us how much faster the winning horse ran than the horses that placed or showed. If a person ranks her preferences for various kinds of soup - pea soup first, then tomato, then onion, and so on - we know which soup she likes most and which soup she likes least, but we have no idea how much better she likes tomato than onion, or if pea soup is far more favored than either one of them.

Since we do not know the distances between the items ranked on an ordinal scale, we cannot add or subtract ordinal data. If mathematical operations are necessary, we need a still more informative scale.

Interval Scales

An interval scale is often compared to a ruler that has been broken off at the bottom - it only goes from say, 5 1/2 to 12. The intervals between 6 and 7, 7 and 8, 8 and 9, and so forth are equal, but there is no zero. A Fahrenheit or centigrade thermometer is an interval scale - even though a certain degree registered on such a thermometer specifies a certain state of cold or heat, there is no such thing as the absence of temperature. One day is never twice as hot as another; it is only so many equal degrees hotter.

An interval scale tells us how many equal-size units one thing lies above or below another thing of the same kind, but it does not tell us how many times bigger, smaller, taller, or fatter on thing is than another. An intelligence test cannot tell us that one person is three times as intelligent as another, only that he or she scored so many points above or below someone else.

Ratio Scales

We can only say that a measurement is two times as long as another or three times as high when we use a ratio scale, one that has a true zero. For instance, if we measure the snowfall in a certain area over several winters, we can say that six times as much snow fell during the winter in which we measured a total of 12 feet as during a winter in which only 2 feet fell. This scale has a zero - there can be no snow.

MEASUREMENTS OF CENTRAL TENDENCY

Usually, when we measure a number of instances of anything - from the popularity of TV shows to the weights of 8-year-old boys to the number of times a person's optic nerve fires in response to electrical stimulation - we get a distribution of measurements that range from smallest to largest or lowest to highest. The measurements will usually cluster around some value near the middle. This value is the central tendency of the distribution of the measurements.

Suppose, for example, you want to keep 10 children busy tossing rings around a bottle. You give them three rings to toss each turn, the game has six rounds, and each player scores one point every time he or she gets the ring around the neck of the bottle. The highest possible score is 18. The distribution of scores might end up like this: 11, 8, 13, 6, 12, 10, 16, 9, 12, 3.

What could you quickly say about the ring-tossing talent of the group? First, you could arrange the scores from lowest to highest: 3, 6, 8, 9, 10, 11, 12, 12, 13, and 16. In this order, the central tendency of the distribution of scores becomes clear. Many of the scores cluster around the values between 8 and 12. There are three ways to describe the central tendency of a distribution. We usually refer to all three as the average.

The arithmetical average is called the mean - the sum of all the scores in the group divided by the number of scores. If you add up all the scores and divide by 10, the total number of scores in this group of ring tossers, you find that the mean for the group is 10.

The median is the point that divides a distribution in half - 50% of the scores fall above the median, and 50% fall below. In the ring-tossing scores, five scores fall at 10 or below, five at 11 or above. The median is thus halfway between 10 and 11, which is 10.5.

The point at which the largest number of scores occurs is called the mode. In our example, the mode is 12. More people scored 12 than any other.

Differences Among the Mean, Median, and Mode

If we take any measurements of anything, we are likely to get a distribution of scores in which the mean, median, and mode are all about the same - the score that occurs most often (the mode) will also be the point that half the scores are below and half above (the median). And the same point will be the arithmetical average (the mean). this is not always true, however, and small samples rarely come out so symmetrically. In these cases, we often have to decide which of the three measures of central tendency - the mean, the median, or mode - will tell us what we want to know.

For example, a shopkeeper wants to know the general incomes of passersby so he can stock the right merchandise. He might conduct a rough survey by standing outside his store for a few days from 12:00 to 2:00 and asking every tenth person who walks by to check a card showing the general range of his or her income. Suppose most of the people checked the ranges between $25,000 and $60,000 a year. However, a couple of the people made a lot of money - one checked $100,000-$150,000 and the other checked the $250,000-or-above box. The mean for the set of income figures would be pushed higher by those two large figures and would not really tell the shopkeeper what he wants to know about his potential customers. In this case, he would be wiser to use the median or the mode.

Suppose instead of meeting two people whose incomes were so great, he noticed that people from two distinct income groups walked by his store - several people checked the box for $25,000-$35,000, and several others checked $50,000-$60,000. The shopkeeper would find that his distribution was bimodal. It has two modes - $30,000 and $55,000 This might be more useful to him than the mean, which could lead him to think his customers were a unit with an average income of about $40,000.

Another way of approaching a set of scores is to arrange them into a frequency distribution - that is, to select a set of intervals and count how many scores fall into each interval. A frequency distribution is useful for large groups of numbers; it puts the number of individual scores into more manageable groups.

Suppose a psychologist tests memory. She asks 50 college students to learn 18 nonsense syllables, then records how many syllables each student can recall two hour later. She arranges her raw scores from lowest to highest in a rank distribution: 2, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 12, 12, 12, 13, 13, 13, 13, 13, 14, 14, 15, 16, 17

The scores range from 2 to 17, but 50 individual scores are too cumbersome to work with. So she chooses a set of two-point intervals and tallies the number of scores in each interval. (1-2, 3-4, 5-6, 7-8, 9-10, 11-12, 13-14, 15-16, 17-18). now she can tell at a glance what the results of her experiment were. Most of the students had scores near the middle of the range, and very few had scores in the high or low intervals. She can see these results even better if she uses the frequency distribution to construct a bar graph - a frequency histogram. Marking the intervals along the vertical axis would give her the graph. Another way is to construct a frequency polygon, a line graph. The figure is not a smooth curve, since the points are connected by straight lines. With many scores, however, and with small intervals, the angles would smooth out, and the figure would resemble a rounded curve.

THE NORMAL CURVE

Ordinarily, if we take enough measurements of almost anything, we get a normal distribution. Tossing coins is a favorite example of statisticians. If you tossed 10 coins into the air 1,000 times and recorded the heads and tails on each toss, your tabulations would reveal a normal distribution. Five heads and five tails would be the most frequent, followed by four heads/six tails and six heads/four tails, and so on down to the rare all heads or all tails.

Plotting a normal distribution on a graph yields a particular kind of frequency polygon called a normal curve. Consider the data on the heights of 1,000 men. Over the bars that reflect the actual data in a graph, you could superimpose an "ideal" normal curve for the same data. Note that the curve is absolutely symmetrical - the left slope parallels the right slope exactly. Moreover, the mean, median, and mode all fall on the highest point of the curve.

The normal curve is a hypothetical entity. No set of real measurements shows such a smooth gradation from one interval to the next, or so purely symmetrical a shape. But because so many things do approximate the normal curve so closely, the curve is a useful model for much that we measure.

Skewed Distributions

If a frequency distribution is asymmetrical - of most of the scores are gathered at either the high end or the low end - the frequency polygon will be skewed. The hump will sit to one side or the other, and one of the curve's tails will be disproportionately long.

If a high school mathematic instructor, for example, gives her students a sixth-grade arithmetic test, we would expect nearly all the scores to be quite high (bell curve skewed to the right). But if a sixth-grade class were asked to do advanced algebra, the scores would probably be quite low (bell curve skewed to the left).

Note, too, that the mean, median, and mode fall at different points in a skewed distribution, unlike in the normal curve, where they coincide. Usually, if you know that the mean is greater than the median of a distribution, you can predict that the frequency polygon will be skewed to the right. If the median is greater than the mean, the curve will be skewed to the left.

Bimodal Distributions

We have already mentioned a bimodal distribution in our description of the shopkeeper's survey of his customers' incomes. The frequency polygon for a bimodal distribution has two humps - one for each mode. The mean and the median may be the same or different.

Measures of Variation

Sometimes it is not enough to know the distribution of a set of data and what their mean, median, and mode are. Suppose an automotive safety expert feels that too much damage occurs in tail-end accidents because automobile bumpers are not all the same height. It is not enough to know what the average height of an automobile bumper is. The safety expert also wants to know about the variation in bumper heights: How much higher is the highest bumper than the mean? How do bumpers of all cars vary from the mean? Are the largest bumpers closer to the same height?

Range

The simplest measure of variation is the range - the difference between the largest and smallest measurements. Perhaps the safety expert measured the bumpers of 1,000 cars 2 years ago and found that the highest bumper was 18 inches from the ground, the lowest only 12 inches from the ground. The range was thus 6 inches - 18 minus 12. This year the highest bumper is still 18 inches high, the lowest still 12 inches from the ground. The range is still 6 inches. Moreover, our safety expert finds that the means of the two distributions are the same - 15 inches off the ground. There is still something the expert needs to know, since the measurements cluster around the mean in drastically different ways. To find out how the measurements are distributed around the mean, our safety expert has to turn to a slightly more complicated measure of variation - the standard deviation.

The Standard Deviation

The standard deviation, in a single number, tells us much about how the scores in any frequency distribution are dispersed around the mean. Calculating the standard deviation is one of the most useful and widely employed statistical tools.

To find the standard deviation of a set of scores, we first find the mean. Then we take the first score in the distribution, subtract it from the mean, square the difference, and jot it down in a column to be added up later. We do the same for all the scores in the distribution. Then we add up the column of squared differences, divide the total by the number of scores in the distribution, and find the square root of that number.

In a normal distribution, however peaked or flattened the curve, about 68% of the scores fall between one standard deviation above the mean and one standard deviation below the mean. Another 27% fall between one standard deviation and two standard deviations on either side of the mean, and 4% more between the second and third standard deviations on either side. Overall, then, more than 99% of the scores fall between three standard deviations above and three standard deviations below the mean. This makes the standard deviation useful for comparing two different normal distributions.

Now let us see what the standard deviation can tell our automotive safety expert about the variations from the mean in the two sets of data. The standard deviation for the cars measured 2 years ago is about 1.4. A car with a bumper height of 16.4 is one standard deviation above the mean of 15; one with a bumper height of 13.6 is one standard deviation below the mean. Since the engineer knows that the data fall into a normal distribution, he can figure that about 68% of the 1,000 cars he measured will fall somewhere between these two heights: 680 cars will have bumpers between 13.6 and 16.4 inches high. For the more recent set of data, the standard deviation is just slightly less than 1. A car with a bumper height of about 14 inches is one standard deviation below the mean; a car with a bumper height of about 16 is one standard deviation above the mean. Thus, in this distribution, 680 cars have bumpers between 14 and 16 inches high. This tells the safety expert that car bumpers are becoming more similar, although the range of heights is still the same (6 inches), and the mean height of bumpers is still 15.

MEASURES OF CORRELATION

Measures of central tendency and measures of variation can be used to describe a single set of instruments - like the children's ring-tossing scores - or to compare two or more sets of measurements - like the two sets of bumper heights. Sometimes, however, we need to know if two sets of measurements are in any way associated with each other - if they are correlated. Is parental IQ related to children's IQ? Does the need for achievement relate to the need for power? Is watching violence on TV related to aggressive behavior?

One fast way to determine whether two variables are correlated is to draw a scatter plot. We assign one variable (X) to the horizontal axis of a graph and the other variable (Y) to the vertical axis. Then we plot a person's score on one characteristic along the horizontal axis and his or her score on the second characteristic along the vertical axis. Where the two scores intersect, we draw a dot. When several scores have been plotted in this way, the pattern of dots tells whether the two characteristics are in any way correlated with each other.

If the dots on a scatter plot form a straight line running between the lower-left-hand corner and the upper-right-hand corner, we have a perfect positive correlation - a high score on one of the characteristics is always associated with a high score on the other one. A straight line running between the upper-left-hand corner and the lower-right-hand corner, is the sign of a perfect negative correlation - a high score on one of the characteristics is always associated with a low score on the other one. If the pattern formed by the dots is cigar shaped in either of these directions, we have a modest correlation - the two characteristics are related but not highly correlated. If the dots spread out over the whole graph, forming a circle or a random pattern, there is no correlation between the two characteristics.

A scatter plot can give us a general idea if a correlation exists and how strong it is. To describe the relation between two variables more precisely, we need a correlation coefficient - a statistical measure of the degree to which two variables are associated. The correlation coefficient tells us the degree of association between two sets of matched scores - that is, to what extent high or low scores on one variable tend to be associated with high or low scores on another variable. It also provides an estimate of how well we can predict form a person's score on one characteristic how high he or she will score on another characteristic. If we know, for example, that a test of mechanical ability is highly correlated with success in engineering courses, we could predict that success on the test would also mean success as an engineering major.

Correlation coefficients can run from +1.0 to -1.0. The highest possible value (+1.0) indicates a perfect positive correlation - high scores on one variable are always and systematically related to high scores on a second variable. The lowest possible value (-1.0) means a perfect negative correlation - high scores on one variable are always and regularly related to low scores on the second variable. In life, most things are far from perfect, so most correlation coefficients fall somewhere between +1.0 and -1.0. A correlation smaller than .20 is considered very low, from +/-20 to +/-40 is low, from +/-40 to +/-60 is moderate, from +/-60 to +/-80 is high, and from +/-80 to +/-1.0 is very high. A correlation of zero indicates that there is no correlation between two sets of scores - no regular relation between them at all.

Correlation tells us nothing about causality. If we found a high positive correlation between participation in elections and income levels, for example, we still could not say that being wealthy made people vote or that voting made people wealthy. We would still not know which came first, or whether some third variable explained both income levels and voting behavior. Correlation only tells us that we have found some association between scores on two specified characteristics.

USING STATISTICS TO MAKE PREDICTIONS

Behind the use of statistics is the hope that we can generalize from our results and use them to predict behavior. We hope, for example, that we can use the record of how well a group of rats runs through a maze today to predict how another groups of rats will do tomorrow, that we can use a person's scores on a sales aptitude test to predict how well he or she will sell life insurance, or that we can measure the attitudes of a relatively small group of people about pollution control to indicate what the attitudes of the whole country are.

First, we have to determine whether our measurements are representative and whether we can have confidence in them. This requires the need for proper sampling.

Probability

Errors based on inadequate sampling procedures are somebody's fault. Other kinds of errors occur randomly. In the simplest kind of experiment, a psychologist will gather a representative sample, split it randomly into two groups, and then apply some experimental manipulation to one of the groups. Afterward, the psychologist will measure both groups and determine whether the experimental group's score is now different from the score of the control group. But even if there is a large difference between the scores of the two groups, it may still be wrong to attribute the difference to the manipulation. Random effects might influence the results and introduce error.

Statistics give the psychologist many ways to determine precisely whether the difference between the groups is really significant, whether something other than chance produced the results, and whether the same results would be obtained with different subjects. These probabilities are expressed as measures of significance. If the psychologist computes the significance level for the results as .05, he or she knows that there are 19 chances out of 20 that the results are not due to chance. But there is still 1 chance in 20 - or a .05 likelihood - that the results are due to chance. A .01 significance level would mean that there is only 1 chance in 100 that the results are due to chance.

USING META-ANALYSIS IN PSYCHOLOGICAL RESEARCH

In previous chapters, we have presented findings from reviews of psychological research in which a research team has summarized a wide selection of literature on a topic in order to reach some conclusions on that topic. There are several crucial decisions to be made in such a process: Which research reports should be included? How should the information be summarized? What questions might be answered after all the available information is gathered?

Traditionally, psychologists reviewing the literature in a particular area relied on the box-score method to reach conclusions. That is, after collecting all the relevant research reports, the researcher simply counted the number supporting one conclusion or the other, much like keeping track of the scoring in nine innings of a baseball game (hence, the term box score). For example, if there were 200 studies on gender differences in aggressive behavior, researchers might find that 120 of them showed that males were more aggressive than females, 40 showed the opposite pattern, and 40 showed no evidence of gender differences. On the basis of these box scores, the reviewer might conclude that males are more likely than females to act aggressively.

Today researchers tend to rely on a more sophisticated strategy known as meta-analysis. Meta-analysis provides a way of statistically combining the results of individual research studies to reach an overall conclusion. In a single experiment, each participant contributes data to help the researcher reach a conclusion. In a meta-analysis, each published study contributes data to help the reviewer reach a conclusion. Rather than relying on the raw data of individual participants, meta-analysis treats the results of entire studies as its raw data. Meta-analysts begin by collecting all available research reports that are relevant to the question at hand. Next, they statistically transform these results into a common scale for comparison. That way differences in sample size (one study might have used 50 participants, another 500), in the magnitude of an effect (one study might have found a small difference, another a more substantial one), and in experimental procedures (which might vary from study to study) can be examined using the same methods.

The key element in this process is its statistical basis. Rather than keeping tally of "yeas" and "nays," meta-analysis allows the reviewer to determine both the strength and the consistency of a research conclusion. For example, instead of simply concluding that there were more studies that found a particular gender difference, the reviewer might determine that the genders differ by 6/10 of a percentage point, or that across all the studies the findings are highly variable.

Meta-analysis has proved to be a valuable tool for psychologists interested in reaching conclusions about a particular research topic. By systematically examining patterns of evidence across individual studies whose conclusions vary, psychologists are able to gain a clearer understanding of the findings and their implications.