DAE 2024 - Jonathan Stallrich

Session II (May 15, 10:30am-12:00pm): Online Experimentation, organized by Jonathan Stallrich

Title: A/B tests with Unobserved Network Spillover: Design and Inference
Speaker: Jonathan Stallrich, NC State University
Abstract: A/B testing or online controlled experiment (OCE) is a ubiquitous technique for comparing the effect of two or more treatments in online settings. The classical randomized design approach to implementing controlled experiments relies on the stable unit treatment value assumption (SUTVA), which states that the outcome of an individual is independent of the treatment assigned to any other individual. It is well-known that SUTVA is violated in online platforms due to a phenomenon known as network interference, where individuals are connected via a social network, and the treatment assigned to one individual can influence the outcomes of its neighbors. Over the last decade or so, the classical randomized design has been largely supplanted by network clustering-based designs in A/B tests to account for this phenomenon. We show that there are two problems with network clustering based designs. First, the network itself is often unobserved or challenging/expensive to measure, which means network clustering cannot be implemented to begin with. Second, there are almost always lurking variables, which are unobserved user features that influence both user response and network formation. We demonstrate that the presence of lurking variables makes network clustering-based estimators biased. We propose a two-stage design and estimation technique called HODOR (Hold-Out Design for Online Randomized experiments) to address both problems. Remarkably, HODOR is based on the classical random design, albeit with a correction to deal with network interference. We carry out a theoretical analysis of HODOR to prove its unbiasedness and to establish its optimal configuration. We also develop a statistical inference framework based on HODOR to carry out hypothesis tests and construct confidence intervals. Through simulation studies and real world data examples, we compare the empirical performance of HODOR to the classical randomized as well as the network clustering-based design.