Causal Inference in the Age of Big Data

Some background papers from Jasjeet:

X learner paper: Meta-learners for Estimating Heterogeneous Treatment Effects using Machine Learning
Time-uniform, nonparametric, nonasymptotic confidence sequences
And on the survey/SATE to PATT problem:
- http://sekhon.berkeley.edu/papers/SATEtoPATT.pdf
- http://sekhon.berkeley.edu/papers/survey_experiment.pdf

Abstract:
The rise of massive datasets that provide fine-grained information about human beings and their behavior offers unprecedented opportunities for evaluating the effectiveness of social, behavioral, and medical treatments. With the availability of fine-grained data, researchers and policymakers are increasingly unsatisfied with estimates of average treatment effects based on experimental samples that are unrepresentative of populations of interest. Instead, they seek to target treatments to particular populations and subgroups. Because of these inferential challenges, Machine Learning (ML) is now being used for evaluating and predicting the effectiveness of interventions in a wide range of domains from technology firms to clinical medicine and election campaigns. However, there are a number of issues that arise with the use of ML for causal inference. For example, although ML and related statistical models are good for prediction, they are not designed to estimate causal effects. Instead, they focus on predicting observed outcomes. In this talk, a number of meta-algorithms are presented that can take advantage of any supervised learning method to estimate the Conditional Average Treatment Effect function. Also discussed are new theoretical results on confidence intervals and overlap in high-dimensional covariates and a new algorithm for optimal linear aggregation functions for tree-based estimators.

Bio:
Jasjeet Sekhon is the Eugene Meyer Professor of Statistics and Data Science and Professor of Political Science at Yale University. He is also the Head of Advanced Data Science at Bridgewater Associates. He has conducted research on causal inference, machine learning, experimental design, and has worked on applications across the social sciences, including political science, economics, and epidemiology. His current research focuses on developing interpretable and credible machine learning methods for estimating causal relationships. Before Yale, he was the Robson Professor of Statistics and Political Science at UC Berkeley. He also has extensive industry experience working with both technology and finance firms.

Summary:

We have much more data to describe world phenomena
- New platforms for collecting data and making existing data available
- New opportunities for insights
Major feature of big data: heterogeneity
- Data covers a wider range of attributes, phenomena, dynamics
- Can use to target interventions
- Making inferences from heterogeneous data is very challenging
  - E.g. p-hacking: rank results by p-value
    - Major principal-agent problem
      - Agents use p-values to promote their ideas
      - Principals can be fooled by this because p-values depend on modeling assumptions (need to be designed carefully)
ML vs Causal Learning
- ML can create biased models if the training data is sampled in biased way
- ML cannot infer directionality of causality
- Interventions in dataset imply there exists an unobserved counterfactual (what if user saw a different ad?)
Justification of modeling approach:
- Classic theory: propose stats model, infer based on this model
  - E.g. everyone uses Normal distribution because its easy to model and kind of works
- ML: train/test loop
Opportunity: if we find the real distribution from which data is sampled then all statistical approaches become more powerful and can find more precise facts
Example:
- They ran experiment of the impact of emails that say whether your neighbors voted to see how that affects your own voting behavior
- Average Treatment Effect: +8% voting
- Effect varies among people: larger for people who vote sometimes (not rarely or often)
- Normal way to find this:
  - Project data onto various unit attributes, look for effects
  - Challenge: p-hacking, will find some projection that shows sensitivity to treatment just by chance
Approach
- T-Learner:
  - Split data on treatment
  - Train separate model for treated group and control group
  - Compare impact of treated vs non-treated
- S-Learner:
  - Don’t split but leave treated as an attribute
  - Train model on treated and control together and let it use treatment attribute as it would
  - Compare predictions of model for same unit with treatment attribute true vs false
- X-Learner (their contribution):
  - Observe that often treatment and control groups have vastly different sizes
  - This is problematic because there may be fine features that can be learned about the larger group that cannot be learned about the smaller one. These features look like treatment effects if we simply use T-learner or S-learner since they’re different between the model’s predictions for treatment and control.
  - X-learner focuses on each sub-dataset (treatment and control) separately
    - Trains model for subset A
    - Uses model for each unit in subset B to predict the counterfactual outcome for that unit
      - How a control unit would behave it if were treated (using model trained on treated units)
      - How a treated unit would behave if it were not treated (using model trained on control units)
    - Compare actual observation vs predicted counterfactual outcome for each unit
    - Can also fuse above:
      - To avoid bias due to biased assignment of treatment don’t use actual treatment indicator but rather the probability that a unit is treated (via another regression)
        For each unit the treatment effect =
        TreatmentProb(unit) * ModelTrainedOnControl(unit) +
        (1-TreatmentProb(unit)) * ModelTrainedOnTreated(unit)