Data Challenge

R Code for Generating Challenge Datasets is available for download (Aug 20, 2019).

Results of the Data Challenge Now Available (May 24, 2019)

  • 19 teams submitted results for 29 methods. Participants were from academia and industry, worldwide.
  • Covariates were simulated or drawn from 7 source datasets: healthcare, business, social. We gratefully acknowledge the funders and maintainers of data repositories at UC Irvine, Vanderbilt University, and Columbia University.
  • The true population ATE for each data file, slides presented at ACIC on May 24, 2019, and plots summarizing each method's performance are now available for download.
  • Updated high-dimensional results files include a revised PCATSv2 entry that fixed a bug in the original reported results. Available for download (Aug 21, 2019).
  • (Although the Challenge is over, the datasets remain available below.)

The Challenge

Provide an estimate of the population average additive treatment effect (ATE) of a binary treatment on a binary or continuous outcome, and a 95% confidence interval. There are 3200 low dimensional datasets, and 3200 high-dimensional datasets. Within each track 100 datasets have been drawn from 32 unique data generating processes (DPG). Participants will download these datasets, run analyses using their own computing resources, and upload results to the website for evaluation. Teams may choose to analyze only the low-dimensional datasets, only the high-dimensional datasets, or submit results for both tracks.

Data Description

Covariates were drawn from publicly available datasets, or simulated. Identifiability of the parameter is guaranteed, however challenges to estimation have been built-in to the processes for generating the binary treatment assignment, and binary or continuous outcome. These include non-linearity of the response surface, treatment effect heterogeneity, varying proportion of true confounders among the observed covariates, and near violations of the positivity assumption. This year's challenge has two tracks:

Low dimensional datasets (varying size, e.g., 500 x 20)

High dimensional datasets (varying size, e.g. 1000 x 200, 2000 x 200)

Scientific Background

The causal parameter of interest is the population average additive treatment effect (ATE), E[Y(1) - Y(0)]. Causal assumptions of consistency and strong ignorability are guaranteed by the contest organizers, therefore the target statistical estimand equals E[E(Y | A = 1, V) - E(Y | A = 0, V)], where Y is the outcome, A is a binary treatment indicator, and V is a pre-treatment covariate vector containing all true confounders, and possibly additional variables related to either the outcome, treatment assignment, or neither.

Challenge Timelines

The Challenge Is Now Closed! Results will be announced at the Conference

(afterwards, details about the DGPs and true population ATE will be provided on this website )

  • December 1, 2018: A few sample datasets available for constructing your entry
  • January 14, 2019: Challenge Datasets now available below!
  • April 15, 2019: Deadline for results files to be uploaded (this has been corrected - it used to say April 8th)
  • May 22-24, 2019: Preliminary results announced at ACIC, 2019, in Montreal, Canada
  • May 24, 2019: True population ATE for each data file and summary performance results available below


Test Datasets for Method Development

  • Files for the Data Challenge will be in the same form as the sample files available for download below that have filenames testdatasetX.csv. Data are in the form (Y, A, V1, ... ,Vp), where
      • Y is the outcome (either binary or continuous)
      • A is a binary treatment indicator
      • V1 through Vp are a mix of continuous, binary, and categorical pre-treatment covariates (no mediators). The number of covariates, p, varies across datasets.
  • For method development we have also provided additional files containing the true treatment effect and the counterfactual outcomes for each observation. These files have the name testdatasetX_cf.csv. Data are in the form (ATE, EY1_i, EY0_i), where
      • ATE = the population average treatment effect (PATE), defined as E(EY(1) - EY(0)), where the expectation is with respect to the population distribution of covariates.
      • EY1_i=expected value of the counterfactual outcome under treatment for subject i
      • EY0_i=expected value of the counterfactual outcome under no treatment for subject i
  • Notes:
    1. The the observed Y_i in the test data file is not identical to either EY1_i or EY0_i because of random error.
    2. The population ATE is not identical to the sample ATE. For the data challenge bias will be assessed by averaging the estimates over the 100 replicates drawn from the same underlying DGP.

Datasets (registration not required for download)

Submit Results

  • Teams will submit results for each track separately. Results should be submitted as a .csv file.
    • Name your file(s) to include the track name, e.g., TeamName_low.csv or TeamName_high.csv
    • There must be one line per dataset (3200 lines total)
    • Each line should have four entries: the dataset name (e.g, low1), the estimated ATE, and lower and upper bounds on the 95% confidence interval

dataset_name, ATE, lb, ub

Entrants in the ACIC, 2019 Data Challenge retain ownership of all intellectual and industrial property rights (including moral rights) in and to submitted results and descriptive information (Submission). As a condition of submission, Entrant grants the Organizers a perpetual, irrevocable, worldwide, royalty-free, and non-exclusive license to use, reproduce, adapt, modify, publish, distribute, publicly perform, create a derivative work from, and publicly display the Submission. Entrants will not be asked to submit code, but may be given the option to publish their code at a later date. Entrants may opt to remain publicly anonymous, but must provide a contact email that will be used by the Organizers only for communications regarding the Submission.