Python package for the paper
A Semiparametric Bayesian Approach to Dropout in Longitudinal Studies with Auxiliary Covariates
by Tianjian Zhou, Michael J. Daniels and Peter Mueller
[Download package] [Manuscript]
We develop a semiparametric Bayesian approach to missing outcome data in longitudinal studies in the presence of auxiliary covariates. We consider a joint model for the full data response, missingness and auxiliary covariates. We include auxiliary covariates to "move" the missingness "closer" to missing at random (MAR). In particular, we specify a semiparametric Bayesian model for the observed data via Gaussian process priors and Bayesian additive regression trees. These model specifications allow us to capture non-linear and non-additive effects, in contrast to existing parametric methods. We then separately specify the conditional distribution of the missing data response given the observed data response, missingness and auxiliary covariates (i.e. the extrapolation distribution) using identifying restrictions. We introduce meaningful sensitivity parameters that allow for a simple sensitivity analysis. Informative priors on those sensitivity parameters can be elicited from subject-matter experts. We use Monte Carlo integration to compute the full data estimands. Performance of our approach is assessed using simulated datasets. Our methodology is motivated by, and applied to, data from a clinical trial on treatments for schizophrenia.
This package contains the Python and R code files to run the semiparametric Bayesian approach to missing data described in the paper.
Explanation of files:
- fn_bspmis.py: Contains all the Python functions needed for posterior inference (through MCMC), G-computation and so on.
- fn_bspmis_bart.R: Contains the R function to run BART (Chipman and McCulloch 2016).
- bspmis.py: The main executable python code to conduct posterior inference (through MCMC) and G-computation.
- bspmis.R: An R interface, for those who have no experience with Python. (Need R getopt package.) Exactly the same way to execute using Rscript bspmis.R. (See details below.)
- ./data/schizophrenia.csv: A simulated dataset under simulation scenario 2.
The following directories need to exist under the current directory:
- ./data: which is the path to the data file xxx.csv
- ./MCMC: which saves the MCMC samples
- ./Gcomp: which saves the replicated responses in the G-computation
- ./results: which is the path to the result file xxx.csv
- ./tmp: saves some temporary files. You can remove the files after you have finished running the code.
- There is one more directory, ./params, which contains the simulation truths described in the paper.
In order to run bspmis:
- Extract the package and go to terminal.
- Make sure you have Python 2.7 and R, common Python packages such as "numpy", "scipy", "pandas", "matplotlib" are required, R packages "BayesTree" and "RcppCNPy" are required.
- The format is
$ python bspmis.py -t -r <Arg1> -i <Arg2> -o <output file name> -e <identifying restriction name>
- -t : If -t is specified, bspmis is run under test mode, i.e. less iterations and less replicated subjects, just to test if it works.
- Arg1: One of "MCMC" or "Gcomp" or "summary". The function to be executed
- Arg2: Data file name. For example, "./data/schizophrenia.csv". Format of the data file: a csv file. The columns with "v" are the auxiliary covariates. The column "s" is the dropout time. The columns with "y" are the responses. In particular "yj" is the response at time j. The column names must have the corresponding character "v", "s" and "y". The responses "yj" need to be ordered, the responses at an earlier time point must appear before the responses at an later time point. The package currently only supports continuous auxiliary covariates. Discrete ones can be easily incorporated if you understand the functions.
- Arg3: Output file name. For example, "./results/result.csv". Format of the result file: It contains the estimated mean responses at every time point y_j, j = 1, ..., J and the treatment effect r = y_J - y_1, as well as the 95% CIs.
- Arg4: Identifying restrictions. Currently supports "MAR" i.e. missing at random, "NFD" i.e. non-future dependence (which has the Unif(0, 1) prior for the sensitivity parameter tau_tilde, can be easily changed), and "ObsOnly", i.e. only drawing the observed responses up to dropout time s.
An example
- First run the following, get the MCMC samples (saved under ./MCMC)
$ python bspmis.py -t -r MCMC -i ./data/schizophrenia.csv
or executing the R script by replacing "python bspmis.py" with "Rscript bspmis.R", as below
$ Rscript bspmis.R -t -r MCMC -i ./data/schizophrenia.csv
- Then run the following, get the replicated subjects (saved under ./Gcomp)
$ python bspmis.py -t -r Gcomp -i ./data/schizophrenia.csv -e MAR
(or Rscript bspmis.R ... )
- Finally run the following, get the summaries
$ python bspmis.py -t -r summary -i ./data/schizophrenia.csv -o ./results/result.csv
(or Rscript bspmis.R ... )