# STATISTICS AND RESEARCH METHODS

A course in statistics using spreadsheets and resampling, in the context of quantitative research methods and reasoning.

There are some common statistical techniques and research methods that are widely-used (in the life sciences, at least). Understanding some of the most common statistical and research methods could potentially be useful for students in many fields of science.

As a part of an undergraduate course on statistics and research methods, I have created a set of 22 activities to help guide students through the process of quantitative research. The activities provide a structured course format that can potentially increase engagement and learning of the course material (Eddy and Hogan, 2017). The activities are organized into three main categories:

SECTION 1. WHY do we need probability and statistics to help us make decisions?

SECTION 2. WHAT are scientific models? How can data lead to scientific understanding?

SECTION 3. HOW can we design research to help make robust discoveries?

A syllabus that provides more detail about how the activities were implemented in the context of a course is shown below. An example of one of the activities (on normal distributions) is also provided below.

The activities are consistent with the Guidelines for Assessment and Instruction in Statistics Education (GAISE) recommendations to:

1. Teach statistical thinking.

Teach statistics as an investigative process of problem-solving and decision making.

Give students experience with multivariable thinking.

2. Focus on conceptual understanding.

3. Integrate real data with a context and purpose.

4. Foster active learning.

5. Use technology to explore concepts and analyze data.

6. Use assessments to improve and evaluate student learning.

Some of the main objectives of my approach to statistics and research methods are

A) To encourage students to construct knowledge about statistics and research methods based on an understanding of the reasons and principles that have led to research practices. For example:

Understanding why statistics are necessary (because of cognitive biases, logical and practical constraints).

Understanding how statistics are based on probability and counting.

Understanding the importance of statistical and scientific MODELS for research.

Understanding how general techniques such as “normalization” and “variance accounted for” can contribute to both statistics and research methods.

B) To encourage active learning using real data. A primary goal (particularly for online learning) is to make statistics and research methods a “hands-on” course. I use several methods to encourage active learning:

Encouraging students to learn how to use spreadsheets (Google Sheets, although Excel could be used for most activities) instead of powerful statistical software. In my estimation, knowing how to use spreadsheets (or some other programming language, spreadsheets being the most accessible) is a fundamental college skill. However, many students do not know how to use spreadsheets, and particularly how to use functions to perform calculations. Therefore, the activities are based around problem-solving with spreadsheet functions.

(I appreciate that using spreadsheets may seem anachronistic. Many other resources are structured around more powerful platforms such as R, and neolanguages such as R and Python are widely used in many areas of data science. However, I see several advantages of using spreadsheets. First, spreadsheets are easy to use and ubiquitous – spreadsheets are designed for use by non-programmers. Spreadsheets do not involve the abstraction of using a text-based programming language, which is very challenging for many students. Second, the relative ease and simplicity of spreadsheets allows for assignments to involve mathematical problem-solving without extensive programming training. For example, students can implement their OWN resampling models to test statistical hypotheses. Granted, the models implemented in spreadsheets are much less powerful than equivalent R functions, but students can understand how processes such as resampling work by implementing the functions themselves. Third, using spreadsheets and spreadsheet functions could help students learn programming later. Spreadsheets are a type of functional programming language, and contain many aspects common to most programming languages (data types, functions, assignment, conditionals, even some looping capability). Learning spreadsheets could help students learn some fundamentals of programming before trying to master the abstractions and challenges of text-based languages and development platforms. Finally, many (most of my) students do not plan on pursuing careers that require powerful statistical programming packages. On the other hand, learning to use spreadsheets is likely to be useful for most vocations. For all of these reasons and more, I maintain that spreadsheets are an appropriate choice for active learning of statistics/research methods concepts in many fields. We also now have lots of anecdotal evidence that asking our students to learn R and statistics during the same class is asking far, far too much.).

Introducing resampling ("bootstrapping") before introducing parametric statistics. The worksheets (and associated spreadsheets) teach students how to set up their own “experiments” using resampling to test statistical hypotheses. Using resampling allows students to actively construct their own sampling distributions, and visualize the processes that underlie statistical tests.

Illustrating course concepts with real data that are currently relevant. The activities draw many of their examples from the COVID-19 pandemic, and other examples from publicly-available datasets such as the 500 Cities Project.

C) To integrate statistics into a broader context of research methods and scientific reasoning. Statistics is only one link in a chain of scientific reasoning. The activities place statistical methods within the larger context of reasoning and science. For example,

Understanding why science involves so much “negativity”—e.g. why null hypotheses must be rejected. Understanding why logic requires the somewhat counter-intuitive reasoning of rejecting null hypotheses.

Understanding how statistical hypotheses relate to research hypotheses (research hypotheses being both general scientific models and measurable predictions).

Analyzing the mathematics of basic statistics to discover that statistics is based on comprehensible principles of counting and algebra.

I have tried to make the activities accessible and conversational. I have tried to incorporate extensive repetition of important concepts throughout the activities (in my experience, repetition is essential for learning). I have tried to structure the activities so that students “discover” many of the important concepts through their own problem solving.

Do they work? I have only anecdotal experience. My sense is that the class experience is challenging and intense, but the students gain an understanding of statistical and research concepts, and successfully learn how to set up and solve problems using spreadsheet functions. However, the course material is very challenging and, in my experience, requires the full 9 hours per week expected for a 3-unit course. Of course, everything is a work in progress. In the future, I hope to port the activities to a platform such as Jupyter Notebook, which could allow for direct assessment of effectiveness.

Another disclaimer. I am NOT a statistician. There may be errors in the activities (hopefully minor ones ;-). There is some inconsistent terminology that I need to make more consistent. Moreover, nothing has been copy-edited by anyone else but me, so there are formatting inconsistencies etc. that need to be addressed. I am thankful to other authors such as Danielle Navarro for providing excellent open-access materials on statistics, which provided inspiration.

I have posted an example syllabus with links to all of the activities here: https://docs.google.com/spreadsheets/d/1Q94zsAP_dB9jIU1kXED5K1xlRX5-80oG4cRKJ9t79Nk/edit?usp=sharing.

All of the activities can also be found in the "Book Version" of Research Methods/Reasoned Writing (although the book version is not quite as updated as the activities in the syllabus).

I am willing to share the entire course with almost anyone. However, I am not willing to share my materials with institutions that use discriminatory hiring or enrollment practices (e.g. institutions that require religious affirmations or other prejudiced policies). Your institution wants to be exclusive? You’ve excluded yourself, sorry. If your institution requires religious affirmations for employment or enrollment, or discriminates against other groups, please do NOT use any of the materials on this site.

An example of one of the course activities. Most weeks, students complete two activities, which are discussed during synchronous lecture/discussion periods.

A Table of Contents for the 22 activities (from the "Book Version") is:

SECTION 1: STATISTICAL RESEARCH METHODS_ 7

1) ESTIMATING PROBABILITIES_ 8

2) USING SPREADSHEETS_ 12

3) COGNITIVE BIASES_ 24

4) POPULATIONS, SAMPLES, AND RESAMPLING_ 31

5) PROBABILITY_ 37

6) CONDITIONAL PROBABILITY_ 54

7) REASONING_ 68

8) SCIENTIFIC MODELS AND PREDICTIONS_ 73

9) MEASUREMENTS_ 79

10) SAMPLES AND POPULATIONS_ 94

11) DESCRIPTIVE STATISTICS_ 107

12) FREQUENCY AND PROBABILITY DISTRIBUTIONS_ 123

13) HYPOTHESIS TESTING_ 136

14) CUMULATIVE DISTRIBUTION FUNCTIONS_ 150

15) THE NORMAL DISTRIBUTION_ 155

16) CONFIDENCE INTERVALS_ 170

17) Z TESTS AND T TESTS_ 179

18) “GOODNESS OF FIT” AND CHI SQUARE TESTS_ 192

19) LOGICAL FALLACIES AND HYPOTHESIS TESTING_ 200

20) CORRELATION AND REGRESSION_ 205

21) MULTIPLE COMPARISONS_ 223

22) RESEARCH DESIGN_ 239