Nathan Justin, Sina Aghaei, Andrés Gómez, Phebe Vayanos
Minor Revision submitted at Operations Research, Short Version in AAAI-22 Workshop on Adversarial Machine Learning and Beyond
Preprints: arXiv, Optimization Online
We consider the problem of learning classification trees that are robust to distribution shifts between training and testing/deployment data. This problem arises frequently in high stakes settings such as public health and social work where data is often collected using self-reported surveys which are highly sensitive to e.g., the framing of the questions, the time when and place where the survey is conducted, and the level of comfort the interviewee has in sharing information with the interviewer. We propose a method for learning optimal robust classification trees based on mixed-integer robust optimization technology. In particular, we demonstrate that the problem of learning an optimal robust tree can be cast as a single-stage mixed-integer robust optimization problem with a highly nonlinear and discontinuous objective. We reformulate this problem equivalently as a two-stage linear robust optimization problem for which we devise a tailored solution procedure based on constraint generation. We evaluate the performance of our approach on numerous publicly available datasets, and compare the performance to a regularized, non-robust optimal tree. We show an increase of up to 12.48% in worst-case accuracy and of up to 4.85% in average-case accuracy across several datasets and distribution shifts from using our robust solution in comparison to the non-robust one.
Short talk presented at ShowCAIS in April 2023.
Longer talk presented at the Machine Learning NeEDS Mathematical Optimization YOUNG Seminar Series in March 2024
Nathan Justin, Qingshi Sun, Andrés Gómez, Phebe Vayanos
Forthcoming at Tutorials in Operations Research
Preprints: arXiv, Optimization Online
In the last few decades, machine learning (ML) has achieved significant success across domains ranging from healthcare, sustainability, and the social sciences to criminal justice and finance. But its deployment in increasingly sophisticated, critical, and sensitive areas affecting individuals, the groups they belong to, and society as a whole raises critical concerns around fairness, transparency, and robustness, among others. As the complexity and scale of ML systems and of the settings in which they are deployed grow, so does the need for responsible ML methods that address these challenges while providing guaranteed performance in deployment. Mixed-integer optimization (MIO) offers a powerful framework for embedding responsible ML considerations directly into the learning process while maintaining performance. For example, it enables learning of inherently transparent models that can conveniently incorporate fairness or other domain-specific constraints. This tutorial paper provides an accessible and comprehensive introduction to this topic discussing both theoretical and practical aspects. It outlines some of the core principles of responsible ML, their importance in applications, and the practical utility of MIO for building ML models that align with these principles. Through examples and mathematical formulations, it illustrates practical strategies and available tools for efficiently solving MIO problems for responsible ML. It concludes with a discussion on current limitations and open research questions, providing suggestions for future work.
Nathan Justin, Daniël Vos, Phebe Vayanos
In preparation for submission at Management Science
In domains such as personalized medicine, historical data is used to learn what treatments to prescribe to maximize positive outcomes. Previous studies have proposed methods for creating prescriptive trees: human-interpretable diagrams that indicate what type of treatment an individual should get based on their measurements. However, a remaining problem is that the models perform worse over time when there is a shift in the data collection process or when data from a different source is used during the training and prediction phases. To solve this problem, we propose a method that considers data uncertainty by optimizing distributionally robust prescriptive trees. We formulate a linear-time algorithm to find the worst-case distribution shift within a given Wasserstein distance around the dataset and use it as a subroutine within the main problem. Our algorithm does not depend on any specific causal effect estimator and can, therefore, be applied in various contexts
Patrick Vossler, Sina Aghaei, Nathan Justin, Nathanael Jo, Andrés Gómez, Phebe Vayanos
Revise and Resubmit at Journal of Machine Learning Research
Preprints: arXiv, Optimization Online
ODTLearn is an open-source Python package that provides methods for learning optimal decision trees for high-stakes predictive and prescriptive tasks based on a mixed-integer optimization (MIO) framework. The current version of the package provides implementations for learning optimal classification trees, optimal fair classification trees, optimal classification trees robust to distribution shifts, and optimal prescriptive trees from observational data. We have designed the package to be easy to maintain and extend as new optimal decision tree problem classes, reformulation strategies, and solution algorithms are introduced. To this end, the package follows object-oriented design principles and supports both commercial (Gurobi) and open source (COIN-OR branch and cut) solvers.
The package documentation and an extensive user guide can be found at https://d3m-research-group.github.io/odtlearn/.
Additionally, users can view the package source code and submit feature requests and bug reports by visiting https://github.com/D3M-Research-Group/odtlearn.
Qingshi Sun, Nathan Justin, Andrés Gómez, Phebe Vayanos
Accepted at AISTATS
Preprints: arXiv, Optimization Online
Logistic regression models are widely used in the social and behavioral sciences and in high-stakes domains, due to their simplicity and interpretability properties. At the same time, such domains are permeated by distribution shifts, where the distribution generating the data changes between training and deployment. In this paper, we study a distributionally robust logistic regression problem that seeks the model that will perform best against adversarial realizations of the data distribution drawn from a suitably constructed Wasserstein ambiguity set. Our model and solution approach differ from prior work in that we can capture settings where the likelihood of distribution shifts can vary across features, significantly broadening the applicability of our model relative to the state-of-the-art. We propose a graph-based solution approach that can be integrated into off-the-shelf optimization solvers. We evaluate the performance of our model and algorithms on numerous publicly available datasets. Our solution achieves a 408x speed-up relative to the state-of-the-art. Additionally, compared to the state-of-the-art, our model reduces average calibration error by up to 36.19% and worst-case calibration error by up to 41.70%, while increasing the average area under the ROC curve (AUC) by up to 18.02% and worst-case AUC by up to 48.37%.