Speaker: Jongha Ryu (MIT)
Title: Improved Offline Contextual Bandits with Second-Order Bounds: Betting and Freezing
Paper: https://arxiv.org/abs/2502.10826
Slides: link
More Details:
Authors: Jongha Ryu, Jeongyeol Kwon, Benjamin Koppe, Kwang-Sung Jun
Abstract: I will present two improved algorithms for off-policy selection and learning in contextual bandits, where the goal is to identify a reward-maximizing policy using offline data collected by a fixed behavior policy.
For off-policy selection, I will introduce a pessimism-based algorithm that uses a new lower confidence bound derived from a betting framework with Cover’s universal portfolio. The resulting guarantee is variance-adaptive and strictly sharper than prior work.
For off-policy learning, I will describe a general condition on optimization objectives that reveals a different bias-variance tradeoff. A special instance, freezing, sharply reduces variance in small-data regimes while matching the best existing guarantees.
I will present experiments showing that the new selection method consistently outperforms previous approaches and that freezing yields substantial gains when data are scarce.
Speaker Bio: Jongha (Jon) Ryu is a postdoctoral associate at MIT EECS. He received his Ph.D. in Electrical and Computer Engineering from UC San Diego. His research focuses on statistical and mathematical foundations of scientific machine learning, with recent work on neural spectral methods, score-based generative modeling, and off-policy or sequential decision-making problems.
Speaker: Nived Rajaraman (UC Berkeley)
Title: Beyond UCB: The curious case of non-linear ridge bandits
Paper: https://arxiv.org/abs/2302.06025
Slides: link
More Details:
Authors: Nived Rajaraman, Yanjun Han, Jiantao Jiao, and Kannan Ramchandran
Abstract: There is a large volume of work on bandits and reinforcement learning when the reward/value function satisfies some form of linearity. But what happens if the reward is non-linear? In this talk, I will present some of our work on the setting where the non-linear reward function takes the form of a ridge function / single index model. Two curious phenomena arise for ridge bandits: first, in addition to the "learning phase" with a standard regret, there is an "initialization phase" with a fixed sample cost determined by the nature of the reward function; second, achieving the smallest sample cost in the initialization phase requires new learning algorithms beyond traditional ones such as UCB. For a special family of non-linear bandits taking the form of a “ridge" function for a non-linear monotone function, we derive upper and lower bounds on the optimal fixed cost of learning, and in addition, on the entire “learning trajectory” via differential equations. Our work proposes a two-stage exploration algorithm which first finds a good initialization, and subsequently exploits local linearity in the learning phase. In several important special cases (convex $f$, and concave $f$), we prove that the optimal guarantees are also achieved by a very simple algorithm, an interactive variant of stochastic gradient descent. In contrast, several classical and celebrated algorithms, such as UCB and algorithms relying on online/offline regression oracles, are proven to be suboptimal.
Speaker Bio: Nived Rajaraman is currently a postdoctoral researcher at Microsoft Research NYC. He recently finished his PhD at Berkeley, advised by Jiantao Jiao and Kannan Ramchandran. His research interests lie in the intersection of sequential decision making and large language models. He was named a CPAL Rising Star for 2025.
Speaker: Dorian Baudry (INRIA)
Title: Does Stochastic Gradient really succeed for bandits?
Paper: https://openreview.net/pdf?id=gL4muAFwsh
Slides: link
More Details:
Authors: Dorian Baudry, Emmeran Johnson, Simon Vary, Ciara Pike-Burke, Patrick Rebeschini
Abstract: Recent works have deepened the theoretical understanding of the Stochastic Gradient Bandit (SGB) policy, showing that using a constant learning rate guarantees asymptotic convergence to the optimal policy, and that sufficiently small learning rates can yield logarithmic regret. However, whether logarithmic regret holds beyond small learning rates remains unclear. In this work, we take a step towards characterizing the regret regimes of SGB as a function of its learning rate. For two–armed bandits, we identify a sharp threshold, below which SGB achieves logarithmic regret on all instances, and above which it can incur polynomial regret on some instances. This result highlights the necessity of knowing (or estimating) the minimum gap to ensure logarithmic regret with a constant learning rate. For general K-armed bandits, we further show the learning rate must additionally scale inversely with K to avoid polynomial regret. We introduce novel techniques to derive regret upper bounds for SGB, laying the groundwork for future advances in the theory of gradient-based bandit algorithms.
Speaker Bio: I am a researcher at Inria Grenoble & Université Grenoble Alpes, working on the theory of sequential decision-making problems, in particular multi-armed bandits. This work was conducted while I was a postdoctoral researcher at the statistics department of Oxford University, where I was working with Patrick Rebeschini. Previously, I had a first experience as a postdoctoral researcher at ENSAE & IP Paris, where I was working with Vianney Perchet; and I received my PhD from Université de Lille in 2022, where I was supervised by Emilie Kaufmann and Odalric-Ambrym Maillard.