TH19: Foundations and Practical Applications in Causal Decision Making

Wednesday, February 21, 2:00 pm - 6:00 pm

AAAI 2024 - Vancouver, Canada

Overview

To make effective decisions, it’s important to have a thorough understanding of the causal connections among actions, environments, and outcomes. This tutorial aims to surface three crucial aspects of decision-making through a causal lens: 1) the discovery of causal relationships through causal structure learning, 2) understanding the impacts of these relationships through causal effect learning, and 3) applying the knowledge gained from the first two aspects to support decision-making via causal policy learning. This tutorial aims to offer a comprehensive methodology and practical implementation framework by consolidating various methods in this area into a Python-based collection. This tutorial will provide a unified framework for you to understand areas including causal inference, causal discovery, randomized experiments, dynamic treatment regimes, bandits, reinforcement learning, and so on. This tutorial is based on an online book with an accompanying Python package (in progress; collaboration is welcomed)

Schedule (Slides)

Introduction and Overview on Causal Decision Making (Rui Song): 2:00 pm - 2:50 pm
Break: 2:50 pm - 2:55 pm
Causal Structure Learning (Hengrui Cai): 2:55 pm - 3:30 pm
Break: 3:30 pm - 4:00 pm
Causal Effect Learning (Yang Xu): 4:00 pm - 4:35 pm
Causal Policy Learning - offline (Runzhe Wan): 4:35 pm - 5:10 pm
Causal Structure Learning - online (Lin Ge): 5:10 pm - 5:45 pm
Floor Discussion: 5:45 pm - 6:00 pm

References

Causal Structure Learning

Pearl, J. (2009). Causal inference in statistics: An overview.
Spirtes, P., Glymour, C., Scheines, R., Kauffman, S., Aimale, V., & Wimberly, F. (2000). Constructing Bayesian network models of gene expression networks from microarray data.
Shimizu, S., Hoyer, P. O., Hyvärinen, A., Kerminen, A., & Jordan, M. (2006). A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(10).
Zheng, X., Aragam, B., Ravikumar, P. K., & Xing, E. P. (2018). Dags with no tears: Continuous optimization for structure learning. Advances in neural information processing systems, 31.
Yu, Y., Chen, J., Gao, T., & Yu, M. (2019, May). DAG-GNN: DAG structure learning with graph neural networks. In International Conference on Machine Learning (pp. 7154-7163). PMLR.
Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L. W. H., Feng, M., Ghassemi, M., ... & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific data, 3(1), 1-9.
Cai, H., Song, R., & Lu, W. (2020, October). ANOCE: Analysis of causal effects with multiple mediators via constrained structural learning. In International Conference on Learning Representations.
Granger, C. W. (1969). Investigating causal relations by econometric models and cross-spectral methods. Econometrica: journal of the Econometric Society, 424-438.
Entner, D., & Hoyer, P. O. (2010). On causal discovery from time series data using FCI. Probabilistic graphical models, 121-128.
Runge, J., Nowack, P., Kretschmer, M., Flaxman, S., & Sejdinovic, D. (2019). Detecting and quantifying causal associations in large nonlinear time series datasets. Science advances, 5(11), eaau4996.
Hyvärinen, A., Zhang, K., Shimizu, S., & Hoyer, P. O. (2010). Estimation of a structural vector autoregression model using non-gaussianity. Journal of Machine Learning Research, 11(5).
Peters, J., Janzing, D., & Schölkopf, B. (2013). Causal inference on time series using restricted structural equation models. Advances in neural information processing systems, 26.
Pamfil, R., Sriwattanaworachai, N., Desai, S., Pilgerstorfer, P., Georgatzis, K., Beaumont, P., & Aragam, B. (2020, June). Dynotears: Structure learning from time-series data. In International Conference on Artificial Intelligence and Statistics (pp. 1595-1605). PMLR.
Sun, X., Schulte, O., Liu, G., & Poupart, P. (2021). NTS-NOTEARS: Learning Nonparametric DBNs With Prior Knowledge. arXiv preprint arXiv:2109.04286.

Causal Effect Learning

Tsiatis, A. A., Davidian, M., Holloway, S. T., & Laber, E. B. (2019). Dynamic treatment regimes: Statistical methods for precision medicine. CRC press.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters.
Zhang, B., Tsiatis, A. A., Laber, E. B., & Davidian, M. (2013). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika, 100(3), 681-694.
Künzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences, 116(10), 4156-4165.
Curth, A., & van der Schaar, M. (2021, March). Nonparametric estimation of heterogeneous treatment effects: From theory to learning algorithms. In International Conference on Artificial Intelligence and Statistics (pp. 1810-1818). PMLR.
Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests.
Shi, C., Blei, D., & Veitch, V. (2019). Adapting neural networks for the estimation of treatment effects. Advances in neural information processing systems, 32.
Edward H Kennedy. Optimal doubly robust estimation of heterogeneous causal effects. arXiv preprint arXiv:2004.14497, 2020
Hicks, R., & Tingley, D. (2011). Causal mediation analysis. The Stata Journal, 11(4), 605-619.
Hong, G. (2010). Ratio of mediator probability weighting for estimating natural direct and indirect effects. In Proceedings of the American Statistical Association, biometrics section (pp. 2401-2415).
Imai, K., Keele, L., & Tingley, D. (2010). A general approach to causal mediation analysis. Psychological methods, 15(4), 309.
Pearl, J. (2022). Direct and indirect effects. In Probabilistic and causal inference: the works of Judea Pearl (pp. 373-392).
Tchetgen, E. J. T., & Shpitser, I. (2012). Semiparametric theory for causal mediation analysis: efficiency bounds, multiple robustness, and sensitivity analysis. Annals of statistics, 40(3), 1816.
Lechner, M. (2011). The estimation of causal effects by difference-in-difference methods. Foundations and Trends® in Econometrics, 4(3), 165-224.
Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254-277.
Li, K. T. (2020). Statistical inference for average treatment effects estimated by synthetic control methods. Journal of the American Statistical Association, 115(532), 2068-2083.
Nie, X., Lu, C., & Wager, S. (2019). Nonparametric heterogeneous treatment effect estimation in repeated cross sectional designs. arXiv preprint arXiv:1905.11622.
Viviano, D., & Bradic, J. (2023). Synthetic learner: model-free inference on treatments over time. Journal of Econometrics, 234(2), 691-713.
Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., & Wager, S. (2021). Synthetic difference-in-differences. American Economic Review, 111(12), 4088-4118.
Fernández-Loría, C., & Provost, F. (2022). Causal decision making and causal effect estimation are not the same… and why it matters. INFORMS Journal on Data Science, 1(1), 4-16.

Causal Policy Learning - Offline

Kallus, N., & Zhou, A. (2021). Minimax-optimal policy learning under unobserved confounding. Management Science, 67(5), 2870–2890.
Fu, Z., Qi, Z., Wang, Z., Yang, Z., Xu, Y., & Kosorok, M. R. (2022). Offline reinforcement learning with instrumental variables in confounded markov decision processes. arXiv Preprint arXiv:2209. 08666.
Xu, L., Kanagawa, H., & Gretton, A. (2021). Deep proxy causal learning and its application to confounded bandit policy evaluation. Advances in Neural Information Processing Systems, 34, 26264–26275.
Uehara, M., Shi, C., & Kallus, N. (2022). A review of off-policy evaluation in reinforcement learning. arXiv Preprint arXiv:2212. 06355.
Prudencio, R. F., Maximo, M. R., & Colombini, E. L. (2023). A survey on offline reinforcement learning: Taxonomy, review, and open problems. IEEE Transactions on Neural Networks and Learning Systems.
Liao, P., Qi, Z., Wan, R., Klasnja, P., & Murphy, S. A. (2022). Batch policy learning in average reward markov decision processes. Annals of Statistics, 50(6), 3364.
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292.
Murphy, Susan A. (2005). A generalization error for Q-learning.
Song, R., Kosorok, M., Zeng, D., Zhao, Y., Laber, E., & Yuan, M. (2015). On sparse representation for optimal individualized treatment selection with penalized outcome weighted learning. Stat, 4(1), 59–68.
Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society. Series B. Statistical Methodology, 65(2), 331–366. doi:10.1111/1467-9868.00389
Liang, Z., Chen, H., Zhu, J., Jiang, K., & Li, Y. (2018). Adversarial deep reinforcement learning in portfolio management. arXiv Preprint arXiv:1808. 09940.
Wang, L., Zhou, Y., Song, R., & Sherwood, B. (2018). Quantile-optimal treatment regimes. Journal of the American Statistical Association, 113(523), 1243–1254.
Cai, H., Shi, C., Song, R., & Lu, W. (2021). Deep jump learning for off-policy evaluation in continuous treatment settings. Advances in Neural Information Processing Systems, 34, 15285–15300.
Jiang, R., Lu, W., Song, R., Hudgens, M. G., & Naprvavnik, S. (2017). Doubly robust estimation of optimal treatment regimes for survival data—with application to an HIV/AIDS study. The Annals of Applied Statistics, 11(3), 1763.
Jeunen, O., & Goethals, B. (2021). Pessimistic reward models for off-policy learning in recommendation. Proceedings of the 15th ACM Conference on Recommender Systems, 63–74.
Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., & Russell, S. (2021). Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34, 11702–11716.
Jin, Y., Yang, Z., & Wang, Z. (2021). Is pessimism provably efficient for offline rl? International Conference on Machine Learning, 5084–5096. PMLR.
Kumar, A., Fu, J., Soh, M., Tucker, G., & Levine, S. (2019). Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32.
Fujimoto, S., Meger, D., & Precup, D. (2019). Off-policy deep reinforcement learning without exploration. International Conference on Machine Learning, 2052–2062. PMLR.
Siegel, N. Y., Springenberg, J. T., Berkenkamp, F., Abdolmaleki, A., Neunert, M., Lampe, T., … Riedmiller, M. (2020). Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv Preprint arXiv:2002. 08396.
Wu, Y., Tucker, G., & Nachum, O. (2019). Behavior regularized offline reinforcement learning. arXiv Preprint arXiv:1911. 11361.
Jaques, N., Ghandeharioun, A., Shen, J. H., Ferguson, C., Lapedriza, A., Jones, N., … Picard, R. (2019). Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv Preprint arXiv:1907. 00456.
Shi, C., Fan, A., Song, R., & Lu, W. (2018). High-dimensional A-learning for optimal dynamic treatment regimes. Annals of Statistics, 46(3), 925.
Shi, C., Song, R., Lu, W., & Fu, B. (2018). Maximin projection learning for optimal treatment decision with heterogeneous individualized treatment effects. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(4), 681–702.
Cai, H., Song, R., & Lu, W. (2021). GEAR: On optimal decision making with auxiliary data. Stat, 10(1), e399.
Fan, A., Song, R., & Lu, W. (2017). Change-plane analysis for subgroup detection and sample size calculation. Journal of the American Statistical Association, 112(518), 769–778.
Kang, S., Lu, W., & Song, R. (2017). Subgroup detection and sample size calculation with proportional hazards regression for survival data. Statistics in Medicine, 36(29), 4646–4659.
Fan, C., Lu, W., Song, R., & Zhou, Y. (2017). Concordance-assisted learning for estimating optimal individualized treatment regimes. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(5), 1565–1582.
Jiang, B., Song, R., Li, J., & Zeng, D. (2019). Entropy learning for dynamic treatment regimes. Statistica Sinica, 29(4), 1633.
Kitagawa, T., & Tetenov, A. (2018). Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica, 86(2), 591–616.
Zhu, W., Zeng, D., & Song, R. (2019). Proper inference for value function in high-dimensional Q-learning for dynamic treatment regimes. Journal of the American Statistical Association, 114(527), 1404–1417.
Liu, Q., Li, L., Tang, Z., & Zhou, D. (2018). Breaking the curse of horizon: Infinite-horizon off-policy estimation. Advances in Neural Information Processing Systems, 31.
Dai, B., Nachum, O., Chow, Y., Li, L., Szepesvári, C., & Schuurmans, D. (2020). Coindice: Off-policy confidence interval estimation. Advances in Neural Information Processing Systems, 33, 9398–9411.
Jiang, N., & Li, L. (2016). Doubly robust off-policy value evaluation for reinforcement learning. International Conference on Machine Learning, 652–661. PMLR.
Kallus, N., & Uehara, M. (2022). Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning. Operations Research.
Feng, Y., Li, L., & Liu, Q. (2019). A kernel loss for solving the bellman equation. Advances in Neural Information Processing Systems, 32.
Zhang, J., & Bareinboim, E. (2019). Near-optimal reinforcement learning in dynamic treatment regimes. Advances in Neural Information Processing Systems, 32.
Zhou, X., & Kosorok, M. R. (2017). Augmented outcome-weighted learning for optimal treatment regimes. arXiv Preprint arXiv:1711. 10654.
Zhao, Y.-Q., & Laber, E. B. (2014). Estimation of optimal dynamic treatment regimes. Clinical Trials, 11(4), 400–407.
Dong, J., Mo, W., Qi, Z., Shi, C., Fang, E. X., & Tarokh, V. (2023). PASTA: Pessimistic Assortment Optimization. arXiv Preprint arXiv:2302. 03821.
Cief, M., Kveton, B., & Kompan, M. (2022). Pessimistic Off-Policy Optimization for Learning to Rank. arXiv Preprint arXiv:2206. 02593.
Shi, L., Li, G., Wei, Y., Chen, Y., & Chi, Y. (2022). Pessimistic Q-learning for offline reinforcement learning: Towards optimal sample complexity. International Conference on Machine Learning, 19967–20025. PMLR.
Shi, C., Zhu, J., Ye, S., Luo, S., Zhu, H., & Song, R. (2022). Off-policy confidence interval estimation with confounded Markov decision process. Journal of the American Statistical Association, 1–12.
Xu, Y., Zhu, J., Shi, C., Luo, S., & Song, R. (2023). An instrumental variable approach to confounded off-policy evaluation. International Conference on Machine Learning, 38848–38880. PMLR.
Wang, J., Qi, Z., & Shi, C. (2022). Blessing from experts: Super reinforcement learning in confounded environments. arXiv Preprint arXiv:2209. 15448.

Causal Policy Learning - Online

Shi, C., Zhang, S., Lu, W., & Song, R. (2022). Statistical inference of the value function for reinforcement learning in infinite-horizon settings. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(3), 765-793.
Ye, S., Cai, H., & Song, R. (2023). Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning. Journal of the American Statistical Association, (just-accepted), 1-20.
Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010, April). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web (pp. 661-670).
Agrawal, S., & Goyal, N. (2013, May). Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning (pp. 127-135). PMLR.
Li, L., Lu, Y., & Zhou, D. (2017, July). Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning (pp. 2071-2080). PMLR.
Kveton, B., Zaheer, M., Szepesvari, C., Li, L., Ghavamzadeh, M., & Boutilier, C. (2020, June). Randomized exploration in generalized linear bandits. In International Conference on Artificial Intelligence and Statistics (pp. 2066-2076). PMLR.
Hazan, E., & Megiddo, N. (2007). Online learning with prior knowledge. In Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007, San Diego, CA, USA; June 13-15, 2007. Proceedings 20 (pp. 499-513). Springer Berlin Heidelberg.
Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1), 48-77.
Langford, J., & Zhang, T. (2007). The epoch-greedy algorithm for contextual multi-armed bandits. Advances in neural information processing systems, 20(1), 96-1.
Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., & Schapire, R. (2014, June). Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning (pp. 1638-1646). PMLR.
Slivkins, A. (2011, December). Contextual bandits with similarity information. In Proceedings of the 24th annual Conference On Learning Theory (pp. 679-702). JMLR Workshop and Conference Proceedings.
Cheung, W. C., Tan, V., & Zhong, Z. (2019, April). A Thompson sampling algorithm for cascading bandits. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 438-447). PMLR.
Zong, S., Ni, H., Sung, K., Ke, N. R., Wen, Z., & Kveton, B. (2016). Cascading bandits for large-scale recommendation problems. arXiv preprint arXiv:1603.05359.
Wang, S., & Chen, W. (2018, July). Thompson sampling for combinatorial semi-bandits. In International Conference on Machine Learning (pp. 5114-5122). PMLR.
Wen, Z., Kveton, B., & Ashkan, A. (2015, June). Efficient learning in large-scale combinatorial semi-bandits. In International Conference on Machine Learning (pp. 1113-1122). PMLR.
Agrawal, S., Avadhanula, V., Goyal, V., & Zeevi, A. (2017, June). Thompson sampling for the mnl-bandit. In Conference on learning theory (pp. 76-78). PMLR.
Oh, M. H., & Iyengar, G. (2019). Thompson sampling for multinomial logit contextual bandits. Advances in Neural Information Processing Systems, 32.
Wan, R., Ge, L., & Song, R. (2023, April). Towards scalable and robust structured bandits: A meta-learning framework. In International Conference on Artificial Intelligence and Statistics (pp. 1144-1173). PMLR.
Cesa-Bianchi, N. O., Gentile, C., Mansour, Y., & Minora, A. (2016, June). Delay and cooperation in nonstochastic bandits. In Conference on Learning Theory (pp. 605-622). PMLR.
Wang, P. A., Proutiere, A., Ariu, K., Jedra, Y., & Russo, A. (2020, June). Optimal algorithms for multiplayer multi-armed bandits. In International Conference on Artificial Intelligence and Statistics (pp. 4120-4129). PMLR.
Dubey, A. (2020, November). Cooperative multi-agent bandits with heavy tails. In International conference on machine learning (pp. 2730-2739). PMLR.
Hsu, C. W., Kveton, B., Meshi, O., Mladenov, M., & Szepesvari, C. (2019). Empirical Bayes regret minimization. arXiv preprint arXiv:1904.02664.
Boutilier, C., Hsu, C. W., Kveton, B., Mladenov, M., Szepesvari, C., & Zaheer, M. (2020). Differentiable meta-learning of bandit policies. Advances in Neural Information Processing Systems, 33, 2122-2134.
Kveton, B., Konobeev, M., Zaheer, M., Hsu, C. W., Mladenov, M., Boutilier, C., & Szepesvari, C. (2021, July). Meta-thompson sampling. In International Conference on Machine Learning (pp. 5884-5893).
PMLR. Lazaric, A., & Brunskill, E. (2013). Sequential transfer in multi-armed bandit with finite set of models. Advances in Neural Information Processing Systems, 26.
Vaswani, S., Schmidt, M., & Lakshmanan, L. (2017, April). Horde of bandits using gaussian markov random fields. In Artificial Intelligence and Statistics (pp. 690-699). PMLR.
Hong, J., Kveton, B., Zaheer, M., Chow, Y., Ahmed, A., & Boutilier, C. (2020). Latent bandits revisited. Advances in Neural Information Processing Systems, 33, 13423-13433.
Wan, R., Ge, L., & Song, R. (2021). Metadata-based multi-task bandits with bayesian hierarchical models. Advances in Neural Information Processing Systems, 34, 29655-29668.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
Spaan, M. T. (2012). Partially observable Markov decision processes. In Reinforcement learning: State-of-the-art (pp. 387-414). Berlin, Heidelberg: Springer Berlin Heidelberg.
Meng, L., Gorbet, R., & Kulić, D. (2021, September). Memory-based deep reinforcement learning for POMDPs. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 5619-5626). IEEE.
Zhu, P., Li, X., Poupart, P., & Miao, G. (2017). On improving deep reinforcement learning for pomdps. arXiv preprint arXiv:1704.07978.
Dimakopoulou, M., Zhou, Z., Athey, S., & Imbens, G. (2019, July). Balanced linear contextual bandits. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, No. 01, pp. 3445-3453).
Kim, W., Kim, G. S., & Paik, M. C. (2021). Doubly robust thompson sampling with linear payoffs. Advances in Neural Information Processing Systems, 34, 15830-15840.
Bareinboim, E., Forney, A., & Pearl, J. (2015). Bandits with unobserved confounders: A causal approach. Advances in Neural Information Processing Systems, 28.
Forney, A., Pearl, J., & Bareinboim, E. (2017, July). Counterfactual data-fusion for online reinforcement learners. In International Conference on Machine Learning (pp. 1156-1164). PMLR.
Lu, Y., Meisami, A., Tewari, A., & Yan, W. (2020, August). Regret analysis of bandit problems with causal background knowledge. In Conference on Uncertainty in Artificial Intelligence (pp. 141-150). PMLR.
Nair, V., Patil, V., & Sinha, G. (2021, March). Budgeted and non-budgeted causal bandits. In International Conference on Artificial Intelligence and Statistics (pp. 2017-2025). PMLR.
Zhang, J., & Bareinboim, E. (2017, May). Transfer learning in multi-armed bandit: a causal approach. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems (pp. 1778-1780).
Huang, W., Zhang, L., & Wu, X. (2022, June). Achieving counterfactual fairness for causal bandit. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 6, pp. 6952-6959).
Gonzalez-Soto, M., Sucar, L. E., & Escalante, H. J. (2018). Playing against nature: causal discovery for decision making under uncertainty. arXiv preprint arXiv:1807.01268.
De Kroon, A., Mooij, J., & Belgrave, D. (2022, June). Causal bandits without prior knowledge using separating sets. In Conference on Causal Learning and Reasoning (pp. 407-427). PMLR.
Lu, Y., Meisami, A., & Tewari, A. (2021). Causal bandits with unknown graph structure. Advances in Neural Information Processing Systems, 34, 24817-24828.
Zhu, S., Ng, I., & Chen, Z. (2019). Causal discovery with reinforcement learning. arXiv preprint arXiv:1906.04477.
Sauter, A. W., Botteghi, N., Acar, E., & Plaat, A. (2024). CORE: Towards Scalable and Efficient Causal Discovery with Reinforcement Learning. arXiv preprint arXiv:2401.16974.
Li, Y., Xie, H., Lin, Y., & Lui, J. C. (2021, April). Unifying offline causal inference and online bandit learning for data driven decision. In Proceedings of the Web Conference 2021 (pp. 2291-2303).
Sawant, N., Namballa, C. B., Sadagopan, N., & Nassif, H. (2018). Contextual multi-armed bandits for causal marketing. arXiv preprint arXiv:1810.01859.
Carranza, A. G., Krishnamurthy, S. K., & Athey, S. (2023, April). Flexible and efficient contextual bandits with heterogeneous treatment effect oracles. In International Conference on Artificial Intelligence and Statistics (pp. 7190-7212). PMLR.
Simchi-Levi, D., & Wang, C. (2023, April). Multi-armed bandit experimental design: Online decision-making and adaptive inference. In International Conference on Artificial Intelligence and Statistics (pp. 3086-3097). PMLR.
Soare, M., Lazaric, A., & Munos, R. (2014). Best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 27.