FAQ

  1. Challenge Timelines
    • December 1, 2018: A few sample datasets available for constructing your entry
    • January 14, 2019: Challenge Datasets are available
    • April 15, 2019: Deadline for results files to be uploaded (updated from April 8, 2019!!)
    • May 22-24, 2019: Preliminary results announced at ACIC, 2019, in Montreal, Canada
    • May 24, 2019: True population ATE for all data files and summary results available on the Data Challenge page
  2. Does each team have to analyze all the datasets in both tracks?

No. Teams can choose to analyze the datasets in only the low-dimensional track, or only the high-dimensional track. Teams that analyze datasets in both tracks should upload two separate results files.

3. What are the exact evaluation criteria for this challenge? [posted January 11, 2019]

We will report bias, MSE, CI width and CI coverage for each DGP, and also summarized over relevant features of the DGP. For more background we recommend a paper describing the 2016 ACIC Data Challenge titled, Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition, by Dorie, et al (2018). The article, six commentaries, and a rejoinder from the authors are available at https://www.imstat.org/journals-and-publications/statistical-science/statistical-science-future-papers/ .

4. If the solution approach is to estimate EY1_i and EY0_i, then the best we can do is to report the sample average treatment effect (SATE). However, the desired outcome of the challenge is the population average treatment effect (PATE), which is not identical to SATE. Is the difference between these two referred to as bias in the description? Do estimates refer to SATEs? [posted January 11, 2019]

The PATE equals the expected value of the SATE. For each DGP we will calculate bias as the difference between the true PATE (known to the organizers) and the average of the SATE estimates from the 100 datasets drawn from that DGP.

5. Will the method with the smallest bias be the winner? [posted January 11, 2019]

No. There may not be a single approach that is best over all DGPs, and bias is not the only relevant metric. For example, a method that exhibits small bias may have high variance, and higher overall MSE. Or it may have problematic CI widths that are either far too narrow (low coverage), or so wide that coverage is near 1. Either of those muddies the subject-matter interpretation of the result, and is worrisome for hypothesis testing.

6. What's the point, if there probably won't be one big winner? [posted January 11, 2019]

In past years certain approaches have stood out from the pack. It will be interesting to see if that continues this year. On a more serious note, this Data Challenge plays two important roles. First, it provides a platform for comparing the blinded performance of alternative estimators across a variety of DGPs. Second, it helps the Causal Inference community gain insight into characteristics of DGPs and methodologies that may affect (or help predict!) performance. This knowledge can guide practitioners choosing among the available methods, and inform theorists working to improve the existing state of the art.

7. How do I register a team for data challenge? I did not find anywhere to register the team in site for data challenge. [posted January 15, 2019]

Teams do not need to register to download the datasets. We will ask for name, contact info, and a brief description of the method when teams upload their results.

8. Why did you provide testsets for the data challenge? Are they only for constructing the entry? [posted January 15, 2019]

Yes. We provided test datasets where the true ATE and EY0_i, EY1_i are known so that participants can verify that their programs are reading the data files and processing them correctly.

9. Is there a possibility for covariates in the dataset to be mediators? That is, a dataset construction such that A -> V1 -> Y, and the correct thing to do is exclude V1? [posted January 16, 2019]

No. Variables V1, ..., Vp in the datasets are all pre-treatment covariates. There are no mediators. (This is a good idea for future Data Challenges!)

10. Could there be latent confounders, i.e., confounders that influence ATE but aren't observed in the training or test data? [posted January 18, 2018]

No. If there were unmeasured confounders the parameter wouldn't be identifiable from the data.

11. a. Are there identifiers identifying the CSV data file columns as being discrete or categorical as opposed to incidentally having integer values? [posted January 21, 2019]

No.

b. I would assume it is fair to know which columns are discrete vs. which are continuous, for future. Because that marginal difference between two engineered solutions will have significant impact in estimating/learning, especially if one were to consider normalization. [posted January 21, 2019]

When analyzing real-world datasets one might use domain knowledge to tailor the analysis. Here we don’t have that knowledge. This is discussed in some of the Commentaries and the Rejoinder to the Dorie, 2018 article (see FAQ #3).

For this challenge the functional form isn’t known, and could vary from one DGP to the next. For example, suppose some variable, V1, is a real number in the dataset. In some DGP we could have generated Y = A + V1 + epsilon, while in another it could be Y = A + I(V1 < 0) + epsilon (where epsilon is mean 0 random error). Or, V1 and V2 could be discrete but enter the model as a ratio. A real-life example is height recorded in meters and weight in kg, but treatment is assigned as a function of BMI = kg / m^2.

Teams can assume that transformations like these were incorporated into at least some of the DGPs.

12. Will performance be evaluated separately for the binary and continuous type of outcomes? [posted January 26, 2019]

We will report overall performance, and also performance metrics within certain characteristics of the DGPs. We haven’t worked out all the details yet, but binary vs. continuous outcomes will be one of the ways we break down the results.

13. Do you require the same approach being applied to both types of outcomes? [posted January 26, 2019]

No. We’ll have no way of knowing how you analyze the data. If you want to use one method when the outcome is binary and another when the outcome is continuous, one could even view that as a single algorithm that data-adaptively chooses whether to apply Approach 1 or Approach 2.

14. The previous year data challenge talked about the computation time. Will this be a criteria this year? If yes, how is it going to be evaluated? [posted January 26, 2019]

We are going to collect descriptive information about the methods. We will ask for an estimated computation time and report it as part of the performance metrics. The aim is to facilitate dissemination of information within the research community.

15. I understand the data estimation data sets, but I'm confused about those corresponding files that contain the truth counterfactuals. Do we use these to see if our CI's are covering the truth (so we know how good we are before we even submit it) or are they for another reason? [posted February 11, 2019]

We knew we wouldn’t have the competition datasets available until mid-January, but wanted to let people get a jump start. We provided test datasets at the end of December to illustrate the format the datasets would be in, and what we meant by “low” vs. “high” dimensional. The counterfactual datasets came from several different DGPs. They were provided because the best one can do on a single dataset is the sample average treatment effect (SATE), not the population average (PATE). The values in the counterfactual files let you calculate SATE, and also let you compare predicted outcomes for individual i with EY1_i and EY0_i as well as the observed Y. All this was just to help participants debug, and perhaps gain a little insight into performance.

16. Will the true PATE for the challenge datasets will be disclosed after the competition deadline. I believe that these would be very useful for the development of future methods to tackle this problem. [posted March 5, 2019]

Yes. We will provide the true values for each of the datasets after the Conference is held in May. (We will wait until then in case we have any problems analyzing the results files we receive and have to request new files.)

17. The datasets were generated by drawing 100 samples from 32 data generating processes (which induce a distribution). It might be possible to combine these samples in order to train models that require large quantities of data (e.g., neural networks). Hence, our question is: in the context of this challenge, would leveraging data from several datasets in order to improve method accuracies across all datasets (i.e., transfer learning) be considered “cheating”? [posted March 5, 2019]

It's definitely not what we intended. We are trying to learn something about finite sample performance, so this violates the spirit of the Challenge. A direct comparison with methods that analyze each dataset separately wouldn't make any sense, since the challenges at sample size n are quite different from those faced at sample size 100n.

On the other hand, maybe we can learn something novel from the findings. Anybody who takes advantage of the fact that it's known that there are 100 replicates per DGP should let us know when they submit results, either directly on the submission form, or by email. We will compare results from that set of approaches with each other, but consider them separately from methods that respected the constraints we intended to impose.

As long as everything is transparent, there shouldn't be a problem. It's also ok to submit two different sets of results -- one where the team analyzes the datasets individually (as we intended), and another that exploits the power of combining datasets you suspect arise from the same DGP.

Have another question? Contact us at sgruber[at]putnamds[dot]com