Stats text 12

Textbook chapter 12 Open data exploration

The instructor should provide data of their choice for open data explorations. Optimally the data should relate both to the students and to current events.

12.1 Open data exploration questions to ask and seek answers to

Note that above list of questions are those appropriate for a student in an introductory statistics course for use in exploring data in ways that demonstrate knowledge of basic statistical functions. If one is a researcher with some knowledge of statistics, then the questions to be asked will differ. For guidance to a researcher looking to engage in effective statistical practice the following guidelines were suggested by Kass, Caffo, Davidian, Meng, Yu, and Reid in 2016 in their article Ten Simple Rules for Effective Statistical Practice:

In modern statistical practice pre-registration of the intended data analysis and methods is considered a necessary practice.

12.2 A variables analysis approach to open data exploration

Another way to tackle analysis of the data is to explore the number and nature of the variables being presented. How many variables? What level of measurement? In introductory statistics one is usually either exploring basic statistics, running correlations, or comparing means.

Data is often organized into tables. In statistics columns are often variables while rows are individual data values. This is not always true, but in introductory statistics this is almost always true. If there is a single data column, then there is one variable. If there are two data columns, then there are two variables. The variable name and the units, if any, are usually listed in the first row of the table.

What can be analyzed, what can be done, depends in part on how many variables are present and the level of measurement. The following chart is for ratio level data. Note that basic statistics can be calculated for any ratio level variable. Remember that columns are variables.

There is a caveat in using this approach, one best captured by the article Ten Simple Rules for Effective Statistical Practice cited above:

While it is obvious that experiments generate data to answer scientific questions, inexperienced users of statistics tend to take for granted the link between data and scientific issues and, as a result, may jump directly to a technique based on data structure rather than scientific goal. For example, if the data were in a table, as for microarray gene expression data, they might look for a method by asking, “Which test should I use?” while a more experienced person would, instead, start with the underlying question, such as, “Where are the differentiated genes?” and, from there, would consider multiple ways the data might provide answers. Perhaps a formal statistical test would be useful, but other approaches might be applied as alternatives, such as heat maps or clustering techniques.

With that in mind, for the student in an introductory statistics course where the objective is to practice statistical operations, an data structures approach is arguably appropriate. The data structures do sometimes provide information on what can be done with the data.

12.2 Single variable statistics: what you can report

12.2 Two column paired dependent data: different variables

If the data is in two columns and the variables are different, then the data is often paired dependent data. For this data looking at the slope, intercept, correlation, and strength of the relationship are likely to be the key statistics to report and discuss. A scatter graph chart should be reported as well. Here the diameter of a hula hoop is correlated with the period of rotation for the hula hoop. 

12.2 Two column paired dependent data: same variables, dependent samples

If the data is in two columns and the variables are the same variables, then a scatter graph may not be appropriate. The data on the right is the same variable and is a before and after data set. This tells you that you are probably going to be looking for a difference in the sample means between two paired dependent variables. For this data the statistics to report would include the mean, and a p-value from a t-test for a difference in paired dependent sample means. A simple column chart of the two means would help tell the story of whether there is a difference in the means. 

12.2 Two column independent data: same variables, independent samples, equal sample sizes

If the data is in two columns and the variables are the same variables, then the data might not be paired dependent data. You will have to read the data description carefully. In the example on the right, the data is not paired, not correlated data. This is change in heart rate for females and males as a result of exercise. For this data the appropriate statistics to report are going to to be the two means, perhaps a simple column chart of the mean for each column, and a p-value from t-test for independent samples to determine if there is a difference in the population means for the two data sets. If the difference is significant, then the effect size should also be reported.

12.2 Two column independent data: same variables, independent samples, unequal sample sizes

If the data is in two columns of unequal sample sizes and the variables are the same variables then the data cannot possibly be paired data. For this data the appropriate statistics to report are going to to be the two means, perhaps a simple column chart of the mean for each column, and a p-value from t-test for two independent samples to determine if there is a difference in the population means for the two data sets. If the difference is significant, then the effect size should also be reported.

12.2 Multi-column data where the variables are the same

We met this example earlier. The variable for each column is the same: the number of flights per month for eight months  separated by airport and whether the flight was domestic or international.

The sample means for each column can help you answer that question. The answer also provides information on the appropriate marketing focus for each airport.

A column chart of means can help tell the story of the data. 

Whether differences between the columns are significant would require making four 95% confidence intervals, one for each column. If the confidence intervals do not overlap, then the samples means are significantly different. 

Use of a "candle stick" chart to depict 95% confidence intervals. Non-overlapping intervals are indicative of statistically significantly different sample means. 

Here a closer look at the O'Hare International and Narita Domestic flights per month shows that the 95% confidence for O'Hare overlaps the Narita sample mean. The O'Hare population mean could be the same as the Narita sample mean, these means are not signficantly difference. The p-value between these two sets of data is 0.14, not surprising. No significant difference. 

12.2 Multi-column data where the variables are different

Multi-column data where the variables are different is the most complicated case to handle in an introductory statistics course.  Here a quantity called Nike fuel is based on only one of three variables: distance, pace, or duration. 

Which variable is NikeFuel based on? 

The solution is to look for the correlation r between NikeFuel and distance, NikeFuel and pace, and NikeFuel and duration. The highest correlation will be the variable that NikeFuel is based on. The chart for this data would be a scatter graph with NikeFuel on the x-axis.

When the variables are different the analysis will more likely be an analysis of correlations to the first column. In this situation the means will obviously be different - the variables are not the same - and thus the means do not provide insight into the data. 

Here the R² is 1.0 and thus the correlation is 1.0. The NikeFuel values are being calculated only from the distance of the run, not the pace and thus not the exertion (as asserted by Nike). Nike would later abandon reporting the NikeFuel system.

12.3 Consider what you know how to do…

A third way to tackle open data exploration in an introductory statistics course is to consider the statistical tools with which one has learned to work during the course. One can be 95% confident that the instructor has chosen a problem that can be resolved by the tools taught in the course. In the "wild" there are many more tools to consider. F-tests for a difference of variances (standard deviations), confidence intervals for a slope, tests for differences of medians, tests for normality. All of these are beyond the scope of this particular course. Thus the beginning student is left with basic statistics (chapters one, two, three), correlations (chapter four), confidence intervals (chapter nine), hypothesis tests against a known mean (chapter ten), and tests for a difference in two sample means (chapter eleven). Those are the tools that have been covered, in this course an open data exploration exercise is likely to utilize those same tools. This is not an approach a statistician would take, but this is one which is appropriate to a student in a first contact statistics course for non-majors.