Creating a scatterplot in Google Sheets
Scatterplots are graphs that represent a relationship between two variables. Two numerical values are measured about each individual being studied. When these two values become ordered pairs that are graphed on a coordinate plane, the resulting graph is called a scatterplot. We often suspect that one of these variables might explain, cause changes in, or help to predict the other variable. The explanatory variable is the variable that we believe may explain or affect the other variable. The explanatory variable is plotted along the x-axis. The response variable is the variable we believe may respond to, or be affected by the other variable. The response variable is plotted along the y-axis. The explanatory variable is often referred to as the independent variable and the response variable is referred to as the dependent variable. Even though we often look for an explanatory-response relationship between the two variables, we can create a scatterplot even if no such relationship exists.
Example 1
State whether or not you suspect that there will be an explanatory-response relationship between each of the following pairs of data. If yes, identify the explanatory and response variables.
a) A college professor decided to examine whether or not there is a relationship between the amount of time that a student studies and his or her score on the mid-term exam. At the end of the exam each student was asked to record the number of hours he or she had spent studying for the mid-term. The professor then made a scatterplot to examine the data.
b) A different professor wanted to see whether or not there is an association between her students’ heights and their IQ scores. She gave each of her students an IQ test and had her TA (teaching assistant) measure each student’s height to the nearest inch. She constructed a scatterplot to examine the data.
Solution
a) It is reasonable to believe that the amount of studying does somehow have an effect on students’ exam scores. The explanatory variable is hours studying and the response variable is exam score. Often thinking in terms of a cause and effect relationship can help identify which variable is which. As a hint, try to determine if one of the variables comes first. If one comes first, then it is most likely the explanatory variable. In our example, studying should come before the exam.
b) It is not reasonable to believe that there is an association between height and IQ scores. Neither of these variables comes before the other and neither would be useful in predicting the other. However, even though we do not believe that there is an explanatory-response relationship between these variables, we can still construct a scatterplot.
Example 2
The following table reports the recycling rates for paper packaging and glass for several individual countries. It would be interesting to see if there is a predictable relationship between the percentages of each material that countries recycle. Construct a scatter plot to examine the relationship. Treat percentage of paper packaging recycled as the explanatory variable.
Solution
We will place the paper recycling rates on the horizontal axis because we are treating it as the explanatory variable. Glass recycling rates are then plotted along the vertical axis. Next, plot a point that shows each country's rate of recycling for the two materials. Be sure to label your axes.
Percent of Paper & Glass Recycled for 19 Countries [Figure4]
Notice that we do not always need to start at zero on either axis when making scatterplots.
When we describe single variable data, we address several characteristics. We used the acronym S.O.C.C.S. to help remember to describe the shape, outliers, center and spread of a distribution. And, to be sure to do all of this in the context of the variables and individuals being studied. For bivariate data, we will again be discussing several characteristics in context. The important characteristics to describe when looking at the relationship between two numerical variables will be strength, outliers, form and direction. And, we will do this in the context of the variables and individuals being compared. The acronym that will help us to remember what to include in our descriptions is: S.C.O.F.D. (strength, context, outliers, form and direction).
When looking at a scatterplot, it is helpful to imagine drawing a line-of-best-fit through the data. A line-of-best-fit is a line that follows the trend of the data. It may go through some, all, or none of the actual points on the scatterplot. Do not actually draw such a line on your plot- just try to determine whether or not such a line would make sense, and if so, where it would fit. As you observe a scatterplot and imagine drawing such a line, you can ask yourself questions such as: How close to a line do the points lie? Would a curved pattern fit better? Are there points that would be far away from the line? Would the line have a positive or negative slope? etc.
Once you have constructed a scatterplot, you can examine the strength of the relationship between the two variables. The strength refers to how closely the points form a pattern. The more closely the points fit a pattern, the stronger the relationship between the variables. The more spread out and scattered the points are, the weaker the relationship. The first plot shows an extremely strong, linear pattern because the points form an obvious line. The second plot is more scattered so it is only moderately strong. And, the third plot does not show much of a pattern at all, so it is moderately to very weak. Keep in mind that the association may be very strong, but not linear. We could find a very clear curved pattern in the data, for example. In the next section we will learn about a statistic, called correlation, that measures the strength of the linear relationship between two variables.
[Figure5]
In example #2, the relationship between paper and glass recycling rates for these countries is very weak.
Do not forget that the graph, the numbers and equations, and the descriptions are all about something-its context. All of these elements should be described in the context of the variables and the individuals being examined.These graphs and statistics are not meaningless, they are about something!
In example #2, the scatterplot explores the relationship between glass and paper recycling rates for several countries.
When examining a scatterplot, look for any data values that do not fit the pattern, or points that stand out from the rest of the data. An outlier will be a point that lies away from the rest of the data or one that seems to affect the strength of the relationship between the two variables. Many outliers will weaken the association between the variables, but they often would not significantly change where a line-of-best-fit would be drawn. An influential point is an outlier that actually seems to influence the line-of-best-fit. Imagine what the plot would look like without the point in question. If it would change the strength, then the point is an outlier. If it would change the slope of a line-of-best-fit, or where the line would be drawn, then the point is influential.
In example #2, there seem to be some outliers. For example, Estonia and New Zealand have much lower paper recycling rates than their glass rates. Without these data values, the relationship would be stronger.
Many scatterplots show a clear form or pattern. The first plot below shows a clearly linear pattern or form. It is easy to imagine drawing a line-of-best-fit through these points. The second plot shows a clearly curved form. A line would not make any sense, so this is non-linear. The third plot shows a great deal of scatter among the points, so it has no form whatsoever.
[Figure8]
In example #2, the scatterplot for paper and glass recycling rates shows a very weak linear form. The relationship is very weak, but no curved pattern is visible. If the outliers were removed, it would become more linear.
The direction of the graph is also important to mention. A graph that goes down to the right has a negative association. That is, as the explanatory variable increases, the response variable decreases. The first plot below has a negative relationship between the variables. A graph that goes up to the right has a positive association. That is, as the explanatory variable increases, the response variable also increases. The second plot shows a positive relationship between the variables. The third plot is an example of a graph that has neither a positive, nor a negative direction. If the relationship is linear and a line-of-best-fit is added to the graph, the slope of the line will be positive if the association is positive. And, the line will have a negative slope if there is a negative linear association between the two variables.
[Figure9]
In example #2, the scatterplot for paper and glass recycling rates shows a positive association. As the paper recycling rate for these countries increases, so does the glass recycling rate.
When you describe the relationship between bivariate data there are several characteristics to include. The acronym S.C.O.F.D. will help you remember to describe the strength of the relationship, be sure that your description is in context, mention any outliers, and to describe the form and direction of the graph.
Example 3
The following example is a scatterplot showing the weights (in pounds) and gas mileage (miles per gallon) for several cars.
a) Identify the explanatory and response variables.
b) Describe what the scatterplot shows. Be sure to address strength, context, outliers, form and direction (S.C.O.F.D.).
Solution
a) explanatory variable is: weight of the cars in pounds
response variable is: gas mileage of the cars (mpg)
b) The relationship between these vehicles' weights in pounds and gas mileage (mpg) is strong and very linear. There are no extreme outliers visible in the graph. The association between a vehicle's weight and gas mileage is negative. As the weight of the vehicles increase, the gas mileage of the vehicles decrease.
Example 4
The following scatterplot shows the data collected by the professor who wanted to see whether or not there is an association between her students’ heights and their IQ scores. She gave each of her students an IQ test and had her TA measure each student’s height to the nearest inch. Describe what the scatterplot shows. Be sure to address strength, context, outliers, form and direction (S.C.O.F.D.).
Solution
There appears to be no relationship between height and IQ scores for these students. The graph has no form and no direction. Therefore, there are no outliers. The relationship has zero strength. There is no pattern or trend between IQ scores and students' heights.