One convenient way to organize numerical data is a dot plot. A dot plot is a simple display that places a dot (or X, or another symbol) above an axis for each datum value (datum is the singular of data). The axis should cover the entire range of the data, even numbers that will have no data marked above them should be included to show outliers or gaps. There is a dot for each value, so values that occur more than once will be shown by stacked dots. Dot plots are especially useful when you are working with a small set of data across a reasonably small range of values. This type of graph gives a clear view of the shape, any mode(s) and the range of a set of data. The numbers are already in order, so finding the median is fairly quick. And any outliers are quickly visible.
Once you have constructed a graphical representation of a data set, the next step is to describe what the graph shows. There are several characteristics that should be mentioned when describing a numerical distribution, and your description needs to explain what this specific data represents. Describe the shape of the graph, whether or not there are any outliers present in the data, the location of the center of the data and how spread out the data is. All of this should be done in the specific context of the individuals and variable being studied. We will use an acronym to help you remember what to include in your descriptions (S.O.C.C.S.) - shape, outliers, context, center and spread. An explanation of each of these characteristics follows.
Shape
Once a graphical display is constructed, we can describe the distribution. When describing the distribution, we should be sure to address its shape. Although many graphs will not have a clear or exact shape, we can usually identify the shape as symmetrical or skewed. A symmetrical distribution will have a middle where we can draw an imaginary line through the center, and a fairly equal "look" on either side of that imaginary line. If you were to fold along the imaginary center line, the two sides would almost match up. Many symmetrical distributions are bell shaped, they will be tall in the middle with the two sides thinning out. The sides are referred to as tails. A skewed distribution is one in which the bulk of the data is concentrated on one end, with the other side being a longer tail. The direction of the longer tail is the direction of the skew. Skewed right will have a longer tail to the right, or higher numbers. Skewed left will have a longer tail off to the left, or the lower values. Other shapes that you might see are uniform (almost consistent height all the way across) and bimodal (having two peaks in the distribution).
Outliers
If there are any outliers, gaps, groupings, or other unusual features in the distribution, we should be sure to mention them. An outlier is a value that does not fit with the rest of the data. Some distributions will have several outliers, while others will not have any. We should always look for outliers because they can affect many of our statistics. Also, sometimes an outlier is actually an error that needs to be corrected. If you have ever 'bombed' one test in a class, you probably discovered that it had a big impact on your overall average in that class. This is because the mean will be affected by an outlier-it will be pulled toward it. This is another reason why we should be sure to look at the data, not just look at the statistics about the data. When an outlier is part of the data and we do not realize it, we can be misled by the mean to believe that the numbers are higher or lower than they really are.
Context
Do not forget that the graph, the numbers and the descriptions are all about something--its context. All of these elements of the distribution should be described in the specific context of the situation in question.
Center
The center of the distribution should always be included in the verbal analysis as well. People often wonder what the 'average is'. The measure for center can be reported as the median, the mean, or the mode. Even better, give more than one of these in your description. Remember that outliers affect the mean, but do not affect the median. For example, the median of a list of data will stay in the center even when the largest value increases tremendously, but such a change would affect the mean quite a bit.
Spread
Another thing to include in the description is the spread of the data. The spread is the specific range of the data. When analyzing a distribution, we don't want to simply say that the range is equal to some number. It is much more informative to say that the data ranges from_____ to ______ (minimum value to maximum value). For example, if the news reports that the temperature in St. Paul had a range of 20o during a given week, this could mean very different temperatures depending on the time of year. It would be more informative to say something specific like, the temperature in St. Paul ranged from 68o to 88o last week.
S.O.C.C.S.
So, when you describe the distribution of a numerical variable, there are several things to include. This text will use the acronym S.O.C.C.S! (shape, outliers, context, center, spread) to help us remember what characteristics to include in our descriptions.
Example 1
An anthropology instructor at the community college is interested in analyzing the age distribution of her students. The students in her Anthropology 102 class are: 21, 23, 25, 26, 25, 24, 26, 19, 18, 19, 26, 28, 24, 22, 24, 19, 23, 24, 24, 21, 23, and 28 years old. Organize the data in a dot plot. Calculate the mean, median, mode, and range for the distribution. Describe the distribution. Be sure to include the shape, outliers, center, context, and spread.
In statistics, data is represented in tables, charts or graphs. One disadvantage of representing data in these ways is that the specific data values are often not retained. Using a stem plot is one way to ensure that the data values are kept intact. A stem plot is a method of organizing the data that includes sorting the data and graphing it at the same time. This type of graph uses the stem as the leading part of the data value and the leaf as the remaining part of the value. The result is a graph that displays the sorted data in groups or classes. A stem plot is used with numerical data when it will be helpful to see the actual values organized in order.
To construct a stem plot you must first determine the range of your distribution. Build the stems so that they cover the entire range, include every stem even if it will have no values after it. This will allow us to see the true shape of the distribution including outliers, whether it is skewed, and any gaps. Then place all of the "leaves" after the appropriate stems. Place the numbers in ascending order out and include all values, so repeats will show more than once. Some people like to put the numbers in order before they construct the stem plot, some like to try to put them in order as they make the plot, and others like to make a rough draft first without regard to order and then to make a final copy with the numbers in the correct order. Any of these methods will result in a correct stem plot.
Example 2
A researcher was studying the growth of a certain plant. She planted 25 seeds and kept watering, sunlight, and temperature as consistent as possible. The following numbers represent the growth (in centimeters) of the plants after 28 days.
a) Construct a stem plot
b) Describe the distribution.
Example 3
Sometimes a stem plot ends up looking too crowded. When the data is concentrated in a few rows, or 'classes', it can be difficult to determine what the shape is or whether there are any outliers in the data. In this example, the stem plot for the ages of a group of people was really concentrated in the 30s and 40s (plot on left). However, the statistician looking at this was not satisfied with the crowded appearance, so she decided to 'split' the stems. The resulting graph on the right, called a split-stem plot, shows very different results. Describe the distribution based on the split-stem plot.