Causal Inference in the Age of Big Data
Jasjeet Sekhon, Yale
Some background papers from Jasjeet:
Abstract:
The rise of massive datasets that provide fine-grained information about human beings and their behavior offers unprecedented opportunities for evaluating the effectiveness of social, behavioral, and medical treatments. With the availability of fine-grained data, researchers and policymakers are increasingly unsatisfied with estimates of average treatment effects based on experimental samples that are unrepresentative of populations of interest. Instead, they seek to target treatments to particular populations and subgroups. Because of these inferential challenges, Machine Learning (ML) is now being used for evaluating and predicting the effectiveness of interventions in a wide range of domains from technology firms to clinical medicine and election campaigns. However, there are a number of issues that arise with the use of ML for causal inference. For example, although ML and related statistical models are good for prediction, they are not designed to estimate causal effects. Instead, they focus on predicting observed outcomes. In this talk, a number of meta-algorithms are presented that can take advantage of any supervised learning method to estimate the Conditional Average Treatment Effect function. Also discussed are new theoretical results on confidence intervals and overlap in high-dimensional covariates and a new algorithm for optimal linear aggregation functions for tree-based estimators.
Bio:
Jasjeet Sekhon is the Eugene Meyer Professor of Statistics and Data Science and Professor of Political Science at Yale University. He is also the Head of Advanced Data Science at Bridgewater Associates. He has conducted research on causal inference, machine learning, experimental design, and has worked on applications across the social sciences, including political science, economics, and epidemiology. His current research focuses on developing interpretable and credible machine learning methods for estimating causal relationships. Before Yale, he was the Robson Professor of Statistics and Political Science at UC Berkeley. He also has extensive industry experience working with both technology and finance firms.
Summary:
We have much more data to describe world phenomena
New platforms for collecting data and making existing data available
New opportunities for insights
Major feature of big data: heterogeneity
Data covers a wider range of attributes, phenomena, dynamics
Can use to target interventions
Making inferences from heterogeneous data is very challenging
E.g. p-hacking: rank results by p-value
Major principal-agent problem
Agents use p-values to promote their ideas
Principals can be fooled by this because p-values depend on modeling assumptions (need to be designed carefully)
ML vs Causal Learning
ML can create biased models if the training data is sampled in biased way
ML cannot infer directionality of causality
Interventions in dataset imply there exists an unobserved counterfactual (what if user saw a different ad?)
Justification of modeling approach:
Classic theory: propose stats model, infer based on this model
E.g. everyone uses Normal distribution because its easy to model and kind of works
ML: train/test loop
Opportunity: if we find the real distribution from which data is sampled then all statistical approaches become more powerful and can find more precise facts
Example:
They ran experiment of the impact of emails that say whether your neighbors voted to see how that affects your own voting behavior
Average Treatment Effect: +8% voting
Effect varies among people: larger for people who vote sometimes (not rarely or often)
Normal way to find this:
Project data onto various unit attributes, look for effects
Challenge: p-hacking, will find some projection that shows sensitivity to treatment just by chance
Approach
T-Learner:
Split data on treatment
Train separate model for treated group and control group
Compare impact of treated vs non-treated
S-Learner:
Don’t split but leave treated as an attribute
Train model on treated and control together and let it use treatment attribute as it would
Compare predictions of model for same unit with treatment attribute true vs false
X-Learner (their contribution):
Observe that often treatment and control groups have vastly different sizes
This is problematic because there may be fine features that can be learned about the larger group that cannot be learned about the smaller one. These features look like treatment effects if we simply use T-learner or S-learner since they’re different between the model’s predictions for treatment and control.
X-learner focuses on each sub-dataset (treatment and control) separately
Trains model for subset A
Uses model for each unit in subset B to predict the counterfactual outcome for that unit
How a control unit would behave it if were treated (using model trained on treated units)
How a treated unit would behave if it were not treated (using model trained on control units)
Compare actual observation vs predicted counterfactual outcome for each unit
Can also fuse above:
To avoid bias due to biased assignment of treatment don’t use actual treatment indicator but rather the probability that a unit is treated (via another regression)
For each unit the treatment effect =
TreatmentProb(unit) * ModelTrainedOnControl(unit) +
(1-TreatmentProb(unit)) * ModelTrainedOnTreated(unit)