Last updated: Sep 2024
Last updated: Sep 2024
Validity in Machine Learning
Machine learning (ML) is a subfield of artificial intelligence that focuses on learning from data. It became one of the most rapidly advancing fields in the last couple of decades, catalyzed by the introduction of statistical machine learning in the 1990s. Research in ML can be broadly categorized into two types: research focused on utility and research focused on validity.
For instance, improvements in regression models, image recognition, and the development of generative models such as large language models fall under the former category. The application of these lines of research to other areas is relatively clear. For example, applying of image recognition technologies in the medical field would be promising, provided the accuracy of these models reaches a practical level.
In contrast, my research focuses on theoretical machine learning that falls in the latter category: validity. The study of validity is particularly important in three domains. The first is decision-making, where we aim to make provably good decisions. Moreover, decisions related to humans, such as university admissions, inherently require fairness and equity. The third domain is social science, where the proposed hypotheses must be reproducible.
My passion for validity in ML stems from a desire to understand fundamental methods. During my Ph.D., I was fascinated by simple ideas such as the upper confidence bound (UCB) and posterior sampling for sequential decision-making. After completing my Ph.D., I expanded my research interests to broader categories such as Monte Carlo tree search, recommendation, financial methods, and econometrics.
The technical tools I use include information theory, which provides a way to measure the probability of events; optimization, which offers a framework for defining goals and complexity; and decision theory, which offers a philosophical foundation for rational behavior under uncertainty.
The multi-armed bandit (MAB) problem is a simple model of sequential decision-making that consists of multiple rounds. It is the most basic version of reinforcement learning that models the tradeoff between exploration (gathering information across all actions) and exploitation (selecting the seemingly most rewarding arms based on the current knowledge). The goal of MAB is to maximize the online objective, where the decision-maker attempts to be better off on average during the run. Notable MAB algorithms include upper confidence bound (UCB, Lai and Robbins 1985), Thompson sampling (TS, Thompson 1933), and minimum empirical divergence (MED, Honda and Takemura 2010).
My primary work until 2019 focused on the characterization of MAB algorithms for various applications, such as online dynamic pricing, slate recommendation for online advertisements, and ranking optimization via comparison for search engines. My key achievement was the characterization of the optimal balance between exploration and exploitation as a unified optimization problem, reducing the complexity of online learning to a static optimization problem within the framework of linear semi-infinite programming. Based on this characterization, I developed algorithms tailored to each application.
My primary work from 2021 to the present has been the characterization of the fixed-budget best-arm identification (BAI) problem. Compared to MAB, which balances exploration and exploitation, BAI focuses on pure exploration: What strategy maximizes the confidence in identifying the best option? In BAI, there are two main settings: The fixed confidence (FC) setting stops when the best arm is identified with confidence level. The fixed budget (FB) setting maximizes confidence given a fixed number of samples. Fixed-duration A/B/C testing as well as Monte Carlo tree search belong to the latter setting.
Our seminal paper [Komiyama, Tsuchiya, Honda NeurIPS2022] demonstrated that algorithms designed for the fixed-confidence setting do not perform well in the fixed-budget setting, which highlights a fundamental difference between the two settings. In particular, for pure exploration, UCB applied to Trees (UCT) in Monte Carlo tree search cannot explore efficiently. To address the scarcity of good algorithms in the FB setting, I propose a minimax optimal algorithm that is sample-efficient but computationally prohibitive. My future work should address computational challenges by developing a computationally efficient alternative.
My other works on BAI include study of Bayesian best arm identification that incorporates the prior knowledge, which is popular in many fields such as economics.
In particular, group-based fairness, addresses the adverse bias of ML models against minority groups. In this area, I have made several fundamental contributions. Our paper [Komiyama, Takeda, Honda, and Shimao 2018] examined the simplest form of linear regression under group-based fairness constraints. We demonstrated that this problem, although nonconvex, belongs to a special class of solvable problems (quadratically constrained quadratic programs), allowing us to solve fair regression exactly.
More recent work explores fairness in sequential learning. For instance, the online fair division of resources among groups (such as refugee camps) is optimized using UCB with dual averaging, enabling the equitable redistribution of items in real-time, even when the value of the items is initially unknown to the distributor.
Finally, fair hiring practices based on contextual information from different groups can be achieved using UCB and a more "greedy" variant of the algorithm.
My work on the reproducibility of results includes research on multiple statistical testing, which assesses the validity of more than one scientific hypothesis. This problem is crucial in domains with numerous hypotheses, such as pattern mining and genomics.
I have also collaborated with researchers in finance on factor investing, which selects assets based on attributes like liquidity, value, and size, known as factors. Factor-based strategies can perform as well as active trading by fund managers though their advantage varies over time. To verify the effectiveness of various factor-based strategies (the so-called "factor zoo"), Fama and French 2010 used bootstrapping methods with limited discussion of their validity. Our recent work justified the Fama and French's bootstrapping method, showing it works when temporal correlation is weak but cross-sectional correlation is high.
Recently, I began working on reproducibility in sequential decision-making. Reproducibility is especially important in this setting, where decision-makers actively seek for information, which implies the existence of selection bias. Recent research demonstrated that minibatch methods are more robust than UCB and Thompson sampling in terms of replicability, a strong form of reproducibility that allows for follow-up experiments by others. We show that a finely-tuned batched algorithm is rate-optimal, meaning it achieves replicability with minimal cost. This supports the widespread practice of A/B/C-testing allocation using small batches.