Beyond UCB: The curious case of non-linear ridge bandits
Authors: Nived Rajaraman, Yanjun Han, Jiantao Jiao, and Kannan Ramchandran
Abstract: There is a large volume of work on bandits and reinforcement learning when the reward/value function satisfies some form of linearity. But what happens if the reward is non-linear? In this talk, I will present some of our work on the setting where the non-linear reward function takes the form of a ridge function / single index model. Two curious phenomena arise for ridge bandits: first, in addition to the "learning phase" with a standard regret, there is an "initialization phase" with a fixed sample cost determined by the nature of the reward function; second, achieving the smallest sample cost in the initialization phase requires new learning algorithms beyond traditional ones such as UCB. For a special family of non-linear bandits taking the form of a “ridge" function for a non-linear monotone function, we derive upper and lower bounds on the optimal fixed cost of learning, and in addition, on the entire “learning trajectory” via differential equations. Our work proposes a two-stage exploration algorithm which first finds a good initialization, and subsequently exploits local linearity in the learning phase. In several important special cases (convex $f$, and concave $f$), we prove that the optimal guarantees are also achieved by a very simple algorithm, an interactive variant of stochastic gradient descent. In contrast, several classical and celebrated algorithms, such as UCB and algorithms relying on online/offline regression oracles, are proven to be suboptimal.
Speaker Bio: Nived Rajaraman is currently a postdoctoral researcher at Microsoft Research NYC. He recently finished his PhD at Berkeley, advised by Jiantao Jiao and Kannan Ramchandran. His research interests lie in the intersection of sequential decision making and large language models. He was named a CPAL Rising Star for 2025.