Tools for Planning & Analyzing Randomized Controlled Trials & A/B Tests
EDM 2024 – Half-Day Tutorial

ABSTRACT

With the rapid growth of education data mining as a research field in recent years and the increasing interest in measuring the effectiveness of the tools we produce, there has been an increasing number of randomized educational experiments and the amount of data gathered from them – particularly with A/B testing on education software platforms. In turn, there is an ever-present demand for robust and flexible estimation and analysis of treatment effects with these data. In this tutorial, we present and discuss several approaches for covariate adjustment to unbiasedly estimate individual treatment effects. These methods are easy to implement with off-the-shelf machine learning algorithms and have been proven to produce more precise effect estimates than the simple difference-in-means. 

TUTORIAL OVERVIEW

Analyses of data from randomized controlled trials (RCT) and A/B tests often utilize t-tests or regression models to estimate causal effects. However, those methods leave a lot of data, power, and science on the table. They neglect – or do not make full use of – the rich baseline data – for instance, prior clickstream or administrative data – available for each student participating in the RCT; they make no use of data on covariates and learning outcomes from students who were not part of the RCT – large, rich auxiliary datasets that are often available; and they estimate overall average treatment effects, masking between-student variations in effectiveness that are sometimes present.

With these concerns in mind, this tutorial aims to teach participants how to use modern statistical approaches to effect estimation that leverage off-the-shelf machine learning algorithms and incorporate auxiliary datasets to estimate individual treatment effects with no bias – but with much greater precision. These methods can identify average treatment effects that would otherwise be lost in the noise of more simple approaches, allow researchers to explore treatment effect heterogeneity, and even plan better experiments to begin with!

This tutorial focuses on the Leave-One-Out Potential Outcomes (LOOP) Estimator [1] – alongside its practical applications and extensions in estimating causal effects in settings of RCTs and A/B tests. The statistical theory underlying some of these methods has been introduced to the EDM community in technical papers and presentations [2, 3]. In contrast, this tutorial will focus on their practical applications – using a new open-source library in R that we have developed. By the end of this tutorial, participants will be able to use these methods almost as effortlessly as running a simple linear regression! While some familiarity with statistics and causal inference methodology – specifically, the potential outcomes framework – will be helpful, they are not required as we will provide an overview of all the necessary concepts. Thus, we highly recommend anyone interested in causal inference generally and its applications in EDM specifically to attend and participate!

TUTORIAL GOALS

By the end of the tutorial, participants will be able to: 

CONTENTS & MATERIALS 

We have developed an open-source package to implement these methods on the R data analysis platform and a Shiny application – a graphical user interface for power analysis using auxiliary data. To facilitate their use for tutorial participants, we will provide a downloadable Docker container that contains the R program, our software library (along with libraries it depends on), and example datasets. By opening the docker container on their laptops, participants can perform all the described analyses without installing R or downloading additional files. The Docker container will run on all major operating systems, including Windows, OSX, and Linux. We will also provide instructions for installing our library and downloading datasets for participants with R already installed on their computers. We will demonstrate and discuss each use case with an EDM-themed example data set. More information will be updated and added here later!

TENTATIVE SCHEDULE

The tutorial is half-day and will take place in the afternoon of July 14th, 2024 from 1:30-5pm ET. Please note that the following schedule is subject to changes:

1:301:45 -- Part I: Conceptual Overview

1:453:00 -- Part II: Estimating Effects with RCT Data

3:003:30 -- Part III: Incorporating Auxiliary Data

3:30–3:45 -- Break

3:454:15 -- Part IV: Treatment Effect Heterogeneity

3:154:00 -- Part V: Planning Experiments

For virtual participation, please use the following Zoom link: http://wpi.zoom.us/my/asales  

For a more thorough overview of the current plan for the tutorial, please take a look at our proposal document. Please note that all details are subject to changes!

Organizers & Presenters

Adam Sales is an Assistant Professor of Statistics and a member of the Learning Sciences and Technologies faculty at Worcester Polytechnic Institute. He works on incorporating log data from computer-based learning applications into causal models to better understand what works in education and why.

Johann Gagnon-Bartsch is an Associate Professor of Statistics at the University of Michigan. His research focuses on causal inference, machine learning, and nonparametric methods with applications in the biological and social sciences.

Duy Pham is a graduate student in the Department of Data Science at Worcester Polytechnic Institute, working with Adam Sales. He has led the development of treatment effect heterogeneity estimators using the LOOP estimator and auxiliary datasets.

Charlotte Mann is a PhD student in the Department of Statistics at the University of Michigan, working with Johann Gagnon-Bartsch. She has led the development of the R package for the LOOP estimator and the application of LOOP for pair-randomized experiments.

Jaylin Lowe is a PhD student in the Department of Statistics at the University of Michigan, working with Johann Gagnon-Bartsch. She is leading the development of the Shiny application and methods to estimate power and select sample sizes using auxiliary data.

All questions may be directed to Duy Pham at dmpham1@wpi.edu.

Acknowledgement

The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D210031. The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.

References: