Abstracts

Gaussian Process Mixtures for Estimating Heterogenous Treatment Effects

Abbas Zaidi*, Duke Univeristy

This paper describes a Gaussian-process mixture model for heterogeneous treatment effect estimation using transformed outcomes. The approach we will present is distinct from previous work in terms of the modeling, the type of inference that can be conducted and the extensions of the model possible. Earlier approaches to transformed outcome modeling have relied extensively and almost exclusively on off the shelf machine learning tools. We aim to demonstrate the improvement in inference from a correctly specified, but still flexible model that leverages the distribution implied by the transformed outcome. Furthermore, our model naturally extends to the cases where the assignment mechanism is learnt from the data for applications to observational studies. Preliminary results on two types of simulated datasets are also presented to demonstrate the proposed model's performance relative to the current literature.

The Effect of Bariatric Surgery on Health Care Utilization: A Synthetic Control Approach Using Bayesian Structural Time Series

Christoph Kurz*, Helmholtz Zentrum München

Surgical measures to combat obesity are very effective in terms of weight loss, recovery from diabetes, and improvement in cardiovascular risk factors. Using claims data from the largest health insurance provider in Germany, we estimated the causal effect of bariatric surgery on health care utilization in a time period from 2 years before up to 3 years after bariatric intervention. Because of the absence of a randomized control group, we employed a Bayesian structural time series model using synthetic controls. We observed a significant decrease in medication and physician expenditures after bariatric surgery, while hospital expenditures slightly increased in the post-intervention period. Overall, we found bariatric surgery to be cost-saving in the observation period when omitting the direct costs of the surgery itself.

Accounting for latent causes in causal inference

David Heckerman*, Not affiliated

Accounting for latent causes is perhaps the most difficult task in causal inference. We present a specific task in genomic analysis, whose solution looks deceptively simple at first glance. We see that the solution is anything but simple, thus highlighting just how challenging the general problem is.

A comparative study of counterfactual estimators

Thomas Nedelec*, Criteo; Vianney Perchet, ENS Cachan; Nicolas Le Roux, Google

We provide a comparative study of several widely used off-policy estimators (Empirical Average, Basic Importance Sampling and Normalized Importance Sampling), detailing the different regimes where they are individually suboptimal. We then exhibit properties optimal estimators should possess. In the case where examples have been gathered using multiple policies, we show that fused estimators dominate basic ones but can still be improved.

Improved Matching for Causal Inference with Text

Reagan Rose*, Harvard University; Luke Miratrix, Harvard University; Aaron Kaufman, Harvard University; Jason Anastasopoulos, University of Georgia

Matching is a tool used to facilitate estimation of treatment effects from observational data in the presence of confounding covariates. Though widely used, most applications of matching have been limited to settings where both the covariates and outcomes are well-defined, low-dimensional quantities. This article explores the problem of matching within a corpus of distinct groups of documents (treatment and control groups), where interest lays in estimating the treatment effect on an outcome associated with each document. In this setting, standard contrasts of outcomes between groups may be biased estimates of the effect of interest due to confounding by high-dimensional and latent features of the text such as topical content or overall sentiment. We present a framework for matching documents that can be used in studies where the covariates and/or the outcome of interest are defined by summary measures of text, which overcomes challenges that are unique to inference with text data.

Causal Inference and Machine Learning with Text Data

YiJyun Lin*, UNR

My proposal first reviews how machine learning has contributed to automated text extraction methods and may be used to map the contours of conflict studies. Second, I discuss the emerging machine learning approaches and techniques in terms of how they are capable of handling the concept drift and how this might contribute to causal inference for social science researches. This proposal concentrates on the problem of concept drift as it closely related to causal inference, which is the fundamental issue confronts social science researches utilizing big data analysis. Finally, this proposal provides a perspective on identifies research gaps and provides a foundation and integration for further research in the field of machine learning with text as Big Data.

Counterfactual Learning for Machine Translation: Degeneracies and Solutions

Carolin Lawrence, Heidelberg University; Pratik Gajane, INRIA; Stefan Riezler*, Heidelberg University

Counterfactual learning is a natural scenario to improve web-based machine translation services by offline learning from feedback logged during user interactions. In order to avoid the risk of showing inferior translations to users, in such scenarios mostly exploration-free deterministic logging policies are in place. We analyze possible degeneracies of inverse and reweighted propensity scoring estimators, in stochastic and deterministic settings, and relate them to recently proposed techniques for counterfactual learning under deterministic logging.

Obtaining Accurate Probabilistic Causal Inference by Post-Processing Calibration

Fattaneh Jabbari*, University of Pittsburgh; Mahdi Pakdaman Naeini, Harvard University; Greg Cooper, UPitt

Discovery of an accurate causal Bayesian network structure from observational data can be useful in many areas of science. Often the discoveries are made under uncertainty, which can be expressed as probabilities. To guide the use of such discoveries, including directing further investigation, it is important that those probabilities be well-calibrated. In this paper, we introduce a novel framework to derive calibrated probabilities of causal relationships from observational data, which consists of three components: (1) a method for generating initial probability estimates of the edge types, (2) a small calibration training set of the causal relationships for which the truth status is known, and (3) a calibration method. We introduce a new calibration method based on a shallow neural network. Our experiments on simulated data support that the proposed approach improves the calibration of causal edge predictions. It also often improves the precision and recall of edge predictions.

Modeling Interference Via Symmetric Treatment Decomposition

Ilya Shpitser*, Johns Hopkins University; Eric Tchetgen Tchetgen, Harvard T.H. Chan School of Public Health; Ryan Andrews, Johns Hopkins University

Classical causal inference assumes a treatment meant for a given unit does not have an effect on other units. When this assumption is violated, new types of spillover causal effects arise, and causal inference becomes much more difficult. We develop a new approach to decomposing the spillover effect into direct and indirect components that extends the treatment decomposition approach to mediation to causal chain graph models. We show that when these components of the spillover effect are identified in these models, they have an identifying functional, which we call the symmetric mediation formula, that generalizes the mediation formula. We further show that, unlike assumptions in classical mediation analysis, an identifying assumption in our setting leads to restrictions on the observed data law, making the assumption empirically falsifiable. Finally, we discuss statistical inference for the components of the spillover effect in the special case of two interacting outcomes.

Semi-parametric Causal Sufficient Dimension Reduction Of High Dimensional Treatment

Razieh Nabi*, Johns Hopkins University; Ilya Shpitser, Johns Hopkins University

Cause-effect relationships are typically evaluated by comparing outcome responses to binary treatment values, representing cases and controls. However, in certain applications, treatments of interest are continuous and high dimensional. For example, in oncology, causal relationship between severity of radiation therapy, represented by a high dimensional vector of radiation exposure values, and side effects is of clinical interest. In such circumstances, a more appropriate strategy for making interpretable causal inferences is to reduce the dimension of treatment. If individual elements of a high dimensional feature vector weakly affect the outcome, but overall relationship between the two is strong, careless approaches to dimension reduction may not preserve this relationship. We use semi-parametric inference theory for structural models to give a general approach to causal dimension reduction of a high dimensional treatment such that the effect of the treatment on outcome is preserved

Fair Inference on Outcomes

Razieh Nabi*, Johns Hopkins University; Ilya Shpitser, Johns Hopkins University

We consider the problem of fair statistical inference involving outcome variables. Examples include classification and regression problems, and estimating treatment effects in randomized trials or observational data. The issue of fairness arises in such problems where some covariates or treatments are “sensitive,” in the sense of having potential of creating discrimination. In this paper, we argue that the presence of discrimination can be formalized in a sensible way as the presence of an effect of a sensitive covariate on the outcome along certain causal pathways, a view which generalizes [1]. A fair outcome model can then be learned by solving a constrained optimization problem. We discuss a number of complications that arise in classical statistical inference due to this view and provide workarounds, based on recent work in causal and semi-parametric inference.

Personalizing Path-Specific Effects

Ilya Shpitser*, Johns Hopkins University; Sourjya Sarkar, Johns Hopkins University

In causal inference for personalized medicine, the goal is to map a given unit’s characteristics to a treatment tailored to maximize the expected outcome for that unit. Obtaining high-quality mappings of this type is the goal of the dynamic regime literature. Aside from the average causal effects, causal mechanisms are also of interest. A well-studied approach to mechanism analysis is establishing effects along a particular set of causal pathways, in the simplest case the direct and indirect effects. Estimating such effects is the subject of the mediation analysis literature. We consider how unit characteristics may be used to tailor treatment assignment that maximizes a path-specific effect. To solve our problem, we define counterfactuals associated with path-specific effects of a policy, give an identification algorithm for these counterfactuals, give a proof of completeness, and show how classification algorithms in machine learning may be used to find a high-quality policy.

Adversarial Balancing for Causal Inference

Pierre Thodoroff*, McGill; Tal El-Hay, IBM; MIchal Ozery-Flato, IBM

Biases in observational data pose a major challenge to treatment effect estimation methods. An important technique that accounts for these biases is reweighting samples to minimize the discrepancy between treatment groups. In this paper, we propose to use the classification error as a measure of similarity between two given datasets. We present a new framework for causal inference that uses a bi-level optimization to alternately, train a discriminator to minimize classification error, and a balancing weights generator to maximize this error. This approach borrows principles from generative adversarial networks (GANs) and more generally from likelihood-free inference aiming to exploit the power of classifiers for discrepancy measure estimation. We validate our approach on a causal inference competition using standard classifiers. Our experiments demonstrate the robustness of this approach in a typical causal inference task.

Active Learning with Logged Bandit Feedback

Songbai Yan*, University of California San Diego; Kamalika Chaudhuri, University of California San Diego; Tara Javidi, University of California San Diego

We consider active learning with logged bandit feedback, where instances are drawn from an underlying distribution, and a logging policy is used to determine whether they should be assigned a binary label. Our goal is to learn a binary classifier that predicts labels with high accuracy on the entire population, not just the distribution of the logged data. Previous work addresses this problem either when only logged data is available, or purely in a randomized experimentation setting. In this work, we combine both approaches to provide an algorithm that uses logged data to bootstrap and inform experimentation, thus achieving the best of both worlds. Our work is inspired by a connection between randomized experimentation and active learning, and modifies existing disagreement-based active learning algorithms to exploit logged data.

Learning functional causal models with generative neural networks

Olivier Goudet*, INRIA; Diviyan Kalainathan, Université Paris-Sud; Michele Sebag, (organization); Philippe Caillou, Université Paris-Saclay; David Lopez-Paz, Facebook; Isabelle Guyon, Clopinet

We introduce a new approach to functional causal modeling from observational data. The approach, called Causal Generative Neural Networks (CGNN), leverages the power of neural networks to learn a generative model of the joint distribution of the observed variables, by minimizing the Maximum Mean Discrepancy between generated and observed data. First, we apply CGNN to the problem of pairwise cause-effect inference. Second, CGNN is applied to the problem of identifying v-structures and conditional independences. Third, we apply CGNN to problem of multivariate functional causal modeling with and without hidden confounders. On all three tasks, CGNN is extensively assessed on both artificial and real-world data with known ground truth, comparing favorably to the state-of-the-art. Finally, a socio-economic application of CGNN is proposed in order to discover causal relationships between quality of life at work and profitability of a company.

Machine Learning for Partial Identification: Example of Bracketed Data

Vira Semenova*, MIT

Partially identified models occur commonly in economic applications. A common problem in this literature is a regression problem with bracketed (interval-censored) outcome variable Y, which creates a set-identified parameter of interest. The recent studies have only considered finite-dimensional linear regression in such context. To incorporate more complex controls into the problem, we consider a partially linear projection of Y on the set functions that are linear in treatment/policy variables and nonlinear in the controls. We characterize the identified set for the linear component of this projection and propose an estimator of its support function. Our estimator converges at parametric rate and has asymptotic normality properties. It may be useful for labor economics applications that involve bracketed salaries and rich, high-dimensional demographic data about the subjects of the study.

Racial Treatment Disparities after Machine Learning Surgical-Appropriateness Adjustment

Noah Hammarlund*, Indiana University

Significant differences in inpatient surgery rates between black and non-black patients suggest a racial treatment disparity. However, these rates need to be adjusted for patient surgical appropriateness to increase patient comparability. In this paper, I focus on the method of this appropriateness adjustment. Using data from the Nationwide Inpatient Sample, I show that the machine learning variable selection methods reveal surgery appropriateness controls from diagnosis codes that decrease the standard adjusted treatment disparity for the cases of acute myocardial infarction by up to 50\%. A statistically and practically significant treatment disparity remains after adjusting for many predictive controls, providing further evidence of the racial treatment disparities' persistence. The proposed approach can be used in different contexts where empirical health adjustment is necessary to make patients more comparable.

Large-Scale Causal Learning

Benoit Rostykus*, Netflix; Tony Jebara, Netflix

We augment causal learning under instrumental variables (IVs) with computational improvements to allow it to scale to large datasets that are typical in modern machine learning. While traditional IV implementations involve linear algebraic operations that have cubic scaling in the dimensions of interest, we achieve linear scalability across 1) the number of samples, 2) the number of non-zero elements of the features, and 3) the number of non-zero elements of the instruments. This is achieved by reformulating IV projections as a single joint optimization over model parameters and performing what we call pairwise stochastic gradient descent, using an importance sampling scheme. The method is also straightforward to extend to non-linear settings through neural-networks or kernel methods. We demonstrate dramatic computational improvements on a large-scale synthetic dataset as well as on a real-world dataset.

Learning Weighted Representations for Generalization Across Designs

Fredrik Johansson*, MIT; Nathan Kallus, Cornell Tech; Uri Shalit, NYU; David Sontag, NYU

Prediction under distributional shift is a common aspect of machine learning applications, two important examples of which are counterfactual estimation and domain adaptation. We pose both problems as special cases of prediction under a shift in design - a change in policy and domain. Popular methods for overcoming distributional shift are often heuristic or rely on assumptions that are rarely true in practice, such as having a well-specified model or knowing the policy that gave rise to the observed data. Other methods are hindered by their need for a pre-specified metric for comparing observations, or by poor asymptotic properties. In this work, we devise a family of algorithms to address these issues, by jointly learning a representation and a re-weighting of observed data in the induced representation. We show that our algorithms minimize an upper bound on the generalization error under design shift, and verify the effectiveness of this approach in causal effect estimation.

Accurate inference for adaptive linear models

Yash Deshpande*, Massachusetts Institute of Technology; Vasilis Syrgkanis, Microsoft Research; Lester Mackey, Microsoft Research; Matt Taddy, Microsoft

When data is collected non-adaptively, the ordinary least squares estimator is known to have an asymptotic Gaussian distribution. However, this asymptotic characterization can be a poor approximation of reality when the data is collected in an adaptive or correlated fashion, even when the linear model is accurate. We develop a general method for decorrelating linear regression estimators in such settings, yielding a decomposition into bias and variance where the bias is typically small. Under a martingale central limit theorem this allows us to construct confidence intervals in the usual fashion. Our methods and results are demonstrated in two common analysis scenarios: evaluation of data obtained through multi-armed bandit algorithms and autoregressive time series inference.

Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution

Judea Pearl*, University of California, Los Angeles

Current machine learning systems operate, almost exclusively, in a statistical, or model-blind mode, which entails severe theoretical limits on their power and performance. Such systems cannot reason about interventions and retrospection and, therefore, cannot serve as the basis for strong AI. To achieve human level intelligence, learning machines need the guidance of a model of reality, similar to the ones used in causal inference. To demonstrate the essential role of such models, I will present a summary of seven tasks which are beyond reach of current machine learning systems and which have been accomplished using the tools of causal inference.

Estimation Considerations in Contextual Bandits

Maria Dimakopoulou*, Stanford University; Susan Athey, Stanford University; Guido Imbens, Stanford University

Contextual bandit algorithms seek to learn a personalized treatment assignment policy, balancing exploration against exploitation. Although a number of algorithms have been proposed, there is little guidance available for applied researchers to select among various approaches. Motivated by the econometrics and statistics literatures on causal effects estimation, we study a new consideration to the exploration vs. exploitation framework, which is that the way exploration is conducted in the present may contribute to the bias and variance in the potential outcome model estimation in subsequent stages of learning. We leverage parametric and non-parametric estimation methods in order to propose new contextual bandit designs.

Unsupervised causal inference using maximum entropy autoencoder (EAE) networks

Julius Ramakers*, University of Duesseldorf

We study the problem of unsupervised causal inference by using a novel maximum entropy autoencoder (EAE). As an objective function, we aim for maximum output entropy as well as full data reconstruction in our end-to-end network architecture. Similar to nonlinear ICA methods, we can reconstruct two components which can be related to a directed acyclic graph that encodes the causal structure. This enables us to recover the true causal direction in an unsupervised manner. We show tests on toy experiments as well as state of the art results on benchmark data sets.

A Novel Evaluation Methodology for Assessing Off-Policy Learning Methods in Contextual Bandits

Negar Hassanpour*, University of Alberta; Russell Greiner, U Alberta

We propose a novel evaluation methodology for assessing off-policy learning methods in contextual bandits. In particular, we provide a way to use any given Randomized Control Trial (RCT) to generate a range of observational studies (with synthesized “outcome functions”) that can match the user’s specified degrees of sample selection bias, which can then be used to comprehensively assess a given learning method. This is especially important in developing methods for precision medicine where deploying a bad policy can have devastating effects. As the outcome function specifies the real-valued quality of any treatment for any instance, we can accurately compute the quality of any proposed treatment policy. This paper uses this evaluation methodology to establish a common ground for comparing the robustness and performance of the available off-policy learning methods in the literature.

Causal Inference via Kernel Deviance Measures

Jovana Mitrovic*, University of Oxford; Dino Sejdinovic, University of Oxford; Yee Whye Teh, University of Oxford

Identifying causal relationships among a set of variables is a fundamental problem in many areas of science. In this paper, we present a novel general-purpose causal inference method, Kernel Conditional Deviance for Causal Inference (KCDC), for inferring causal relationships from observational data. In particular, we propose a novel interpretation of the well-established notion of asymmetry between cause and effect. Based on this, we derive an asymmetry measure using the framework of representing conditional distributions in reproducing kernel Hilbert spaces thus providing the basis for causal discovery. We demonstrate the versatility and robustness of our method across several synthetic datasets. Furthermore, we test our method on the real-world benchmark dataset Tübingen Cause-Effect Pairs where it outperforms existing state-of-the-art methods.

Inference on TIme Series: A Sequential, Non-Monotone Missingness Model

Eli Sherman*, Johns Hopkins University; Ilya Shpitser, Johns Hopkins University; Daniel Scharfstein, Johns Hopkins University

Missing data is one of the most ubiquitous problems in statistical analyses of healthcare data. A large body of work within missing data assumes monotonicity, where missingness at time t implies missingness at all subsequent times. The classical missing at random (MAR) model was extensively studied in this setting. Unfortunately, most missing data in practice is intermittent, and missing not at random, making the monotonic MAR model inapplicable in practice. Existing models for non-montone data have the unintuitive feature that missingness status at time t depends on times after t. We discuss a natural model for non-monotone data that is missing not at random and does not suffer from this problem. We demonstrate identification, and discuss connections to the monotone MAR model, and consider maximum likelihood and semi-parametric inference schemes

Learning item embeddings using biased feedback

Ashudeep Singh*, Cornell University; Thorsten Joachims, Cornell

Learning item embeddings from browsing logs of recommender systems provides intriguing opportunities for understanding user preferences. However, such log data can be severely biased because recommendations imply a selection bias on the number of clicks an item receives. This selection bias can lead to learned embeddings that are distorted by past recommendations and that do not reflect the true semantic similarity one would like to capture. To overcome this problem, we formulate the task of learning embeddings as a counterfactual learning problem: how would the user have clicked, if the recommendation algorithm had not interfered? To demonstrate effectiveness and promise of this approach, we present synthetic experiments that illustrate how the counterfactual learning approach can recover the true embeddings despite biased data.

Propensity Score Covariate Balancing with Gaussian Processes

Brian Vegetabile*, UC Irvine; Hal Stern, UC Irvine

The propensity score is an important tool for enabling causal inference in observational studies. Defined loosely, it is the probability of being treated given a set of pretreatment covariates. When modeled properly, the propensity score is a balancing score and adjustment on the propensity score provides unbiased inference under certain assumptions. This paper outlines a method of modeling the propensity score using a Gaussian process where the hyperparameters of the process are chosen to minimize covariate imbalance (under a defined metric). We show that the estimated propensity scores arising from our method produce results that are comparable to, and in some cases exceed, those that would be attained by adjustment on the true data-generating propensity score.

Double Boosting for High Dimensional IV Regression Models

Tae Hwy Lee*, University of California Riverside

Endogeneity in a regression model leads to an inconsistent estimation. The standard solutions are the two stage least squares (2SLS) and generalized method of moments (GMM). However, both methods face challenges with high dimensional instruments, especially when some of instruments are irrelevant and/or invalid. It is critical to select relevant and valid instruments for a consistent estimation. In this paper, we introduce a new selection method, Double Boosting (DB), which consistently selects relevant and valid instruments simultaneously as the sample size n increases even when dim(Z)≫n. Furthermore, we also show that the DB will not select weakly relevant instruments or weakly valid instrument, with the extents of weakness being defined in the sense of local to zero asymptotics. In estimating the BLP-type automobile demand function, where price is endogenous and instruments are high dimensional functions of product characteristics, we demonstrate the merits of the new DB procedure.

Orthogonal Machine Learning: Power and Limitations

Lester Mackey, Microsoft Research; Vasilis Syrgkanis*, Microsoft Research; Ilias Zadik, MIT

Double machine learning provides $\sqrt{n}$-consistent estimates of parameters of interest even when high-dimensional or nonparametric nuisance parameters are estimated at an $n^{-1/4}$ rate. The key is to employ Neyman-orthogonal moment equations which are first-order insensitive to perturbations in the nuisance parameters. We show that the $n^{-1/4}$ requirement can be improved to $n^{-1/(2k+2)}$ by employing a $k$-th order notion of orthogonality that grants robustness to more complex or higher-dimensional nuisance parameters. In the partially linear model setting popular in causal inference, we use Stein's lemma to establish necessary and sufficient conditions for the existence of second-order orthogonal moments and demonstrate the robustness benefits of an explicit doubly-orthogonal estimation procedure for treatment effect.

Causal learning through Bayesian classification

Karen Sachs*, MIT

In fields like molecular biology, representation of the underlying joint probability distribution can reveal underlying regulatory relationships and enable therapeutically actionable inferences, if the representation has a causal interpretation. Causal representations can be hard to achieve when interventional data is infeasible or unavailable. However, datasets are often annotated with a label that reflects states of the tissue of origin, such as disease state (‘cancer’ vs. ‘normal’). We capitalize on the state labels to build a Bayesian classifier, which has the useful effect of orienting edges in the learned network classifier. We discuss issues inherent to this approach and demonstrate results in human leukemia data.

Linear Models for Estimating Causal and Statistical Parameters given Missing Data

Karthika Mohan*, U C Berkeley; Judea Pearl, University of California, Los Angeles

We present novel techniques for handling missing datasets comprising of continuous variables generated by linear Structural Equation Models. To this end, we define quasi-linear models for missingness and derive conditions under which covariance matrix and causal effects can be consistently estimated. Specifically, we estimate the full covariance matrix using a procedure dictated by the graph structure. Finally, we present a general procedure for computing mean and variance of any variable corrupted by missing values.

Bayesian Causality

Pierre Baldi*, University of California Irvine; Babak Shahbaba, University of California Irvine

Although no universally accepted definition of causality exists, in practice one is often faced with the question of statistically assessing causal relationships in different settings. We present a uniform general approach to causality problems derived from the axiomatic foundations of the Bayesian statistical framework. In this approach, causality statements are viewed as hypotheses, or models, about the world and the fundamental object to be computed is the posterior distribution of the causal hypotheses, given the data and the background knowledge. Computation of the posterior may involve complex probabilistic modeling but this is no different than in any other Bayesian modeling situation. The main advantage of the approach is its connection to the axiomatic foundations of the Bayesian framework, and the general uniformity with which it can be applied to a variety of causality settings, ranging from specific to general cases, or from causes of effects to effects of causes.