Motivation: Studies have shown that, it is often easier and less biased to elicit user feedback through preferences rather than their absolute scores/ratings. For example, to understand their liking for a given pair of items, say (A,B), it is easier for the users to answer a preference-based query: Do you prefer Item A over B?", rather than their absolute counterparts: How much do you score items A and B out of 10?". Be it a perfume dealer wanting to maximize revenue by launching a set of 5 most-popular fragrances, a movie-recommender system trying to rank the most favorite movies for recommending to its users, or perhaps a pharmaceutical company testing the relative efficacies of a set of drugs|learning from preference feedback has widespread applicability in many real-world problems, and could greatly impact designing better systems for man-machine interaction, especially where human preferences are elicited in an online fashion; say in, design of surveys and expert reviews, assortment selection, search engine optimization, recommender systems, ranking in multiplayer games, etc.; or even more general reinforcement learning problems where rewards shaping is often a challenging problem, e.g. say in multi-objective rewards based optimization etc. In such scenarios, instead, a preference feedback is much easier to elicit as well as a much more consistent and cost-effective way to learn from user feedback.
Limitations of the Existing (reward-based) Multiarmed Bandit (MAB) Framework: Despite the wide prevalence of preference-based learning settings, unfortunately, the classical reward/loss objective-based online learning approaches such as the Multi-armed Bandits (Auer, Cesa-Bianchi, and Fischer [2002]) are inadequate to express relative choices between items, as they typically model absolute utility or reward feedback of the selected set of items. The Dueling Bandit (DB) problem (Yue and Joachims [2009]) first attempted to model this problem with pairwise preferences (i.e. subsets of size 2). However, why should we restrict the feedback only to pairwise preferences when a subsetwise preference model is usually much more relevant in various practical scenarios (e.g., recommender systems, search engine optimization, crowd-sourcing, e-learning platforms, etc.), budget-friendlier, and flexible in expressing several types of feedback choices; e.g., best item, full ranking, top-5, or any partial rank ordering of the subset. This paved the way of preference bandits with general subsetwise feedback (Ren, Liu, and Shroff [2018]), also studied as Battling Bandits (BB) (Saha and Gopalan [2018,2019,2020]).
Main topics to cover: The first part of the tutorial will cover the basics of multiarmed bandits (MAB) (Auer [2000]; Audibert and Bubeck [2010]) and the limitations of this classic framework in the absence of rewards/losses. We will then focus on the advantages of online decision making from preference feedback in various real-world scenarios, often studied famously as Dueling Bandits' (for learning from pairwise preferences) (Zoghi, Whiteson, Munos, Rijke [2014]; Gajane, Urvoy, and Clerot [2015]; Saha, Koren, Mansour [2021]) or more generally `Battling Bandits' Saha and Gopalan [2018]; Bengs, Saha, Hullermeir [2021]; Saha [2021] problems. We will then cover the breakthrough results for stochastic preference bandits for different problem objectives including regret minimization, PAC best-arm identification, learning-to-rank etc., and also some of the recent developments in more complex settings which are arguably harder to address (Gupta and Saha [2021]; Saha and Krishnamurthy [2022]; Saha, Koren, and Mansour [2021,2022]). We next understand the scopes of preference-based learning in the realm of other online sequential learning frameworks and some of its recent extensions to more general setups including sleeping bandits, large/infinite decision spaces, contextual scenarios, or even reinforcement learning frameworks. In the last part, we will discuss some of the key open questions in preference bandits, the huge potential of applying preference-based learning in Reinforcement Learning scenarios (famously studied as `PbRL problem' Wirth et al. [2017]; Pacchiano, Saha, and Lee [2021]; Novoseller et al. [2020]; Xu et al. [2020] and the gaps between theory and practice.
Main objectives: Broadly, the goal of this tutorial is to encourage the ML community to rethink standard decision-making interaction models that are usually being assumed. Undoubtedly, classical online learning algorithms can result in well-performing algorithms for simple problems. But that is often unsuitable for many learning scenarios in practice, especially where the reward or the loss objectives are not properly defined. The tutorial intends to benefit any ML researcher/graduate student or anyone seeking some basic exposure to techniques of online learning and preference elicitation. Finally, we would consider it to be a success if the sessions inspire researchers to embark on new projects and explore novel areas of research in online learning with complex reward models. It would also be very exciting to bring more attention to experimental researchers as well as theorists with the objective of fostering the development of better models and practical algorithms for the different problems we will cover during the tutorial. We would also be looking forward to discussing some brainstorming questions and interesting open problems to delve further.
Hope to see you at the tutorial. Please contact us if you have any other questions!