Beyond Rewards: Online Learning from Preference Feedback
Tutorial at ECML-PKDD 2022 - Sept 23 @ 9 am - 12:30 pm CEST (Online)


This tutorial intends to cover the development and recent progress on Preference based online learning where the goal is to sequentially learn the best-action of a decision set from preference feedback over an actively chosen subset of items. We will first cover the basic fundamentals of classical (reward) based multiarmed bandits problem and the limitations of this framework when reward are unknown/ harder to obtain. Drawing motivations from these limitations, we will then start with a brief overview of the motivation and problem formulation and understand the breakthrough results for the simplest pairwise preference setting (where the subsets are of size 2), famously studied as the `Dueling Bandit' problem in the literature. We will further generalize this to the `Battling Bandits' (general subsetwise preference based bandits) framework for subsets of any arbitrary size and understand the tradeoff between learning rates-vs-increasing subset sizes. See details of the topics here.

Tutorial Schedule

  • Part-I: Basics of (reward based) Multiarmed Bandits (MAB), it's limitations and motivation for preference bandits [45 mins]

  • Part-II: Dueling Bandits and real world applications [30 mins]

-------------------------------- BREAK [ 30 mins ] -----------------------------------

  • Part-III: Battling Bandits (Preference Bandits): Learning with subsetwise feedback and implications [45 mins]

  • Part-IV: Recent advancements and future scopes: Reinforcement learning with preference based feedback: Challenges in theory and practice [30 mins]

Target Audience (Prerequisites)

The tutorial is meant to be accessible to the entire machine learning community, and especially useful for bandits and reinforcement learning researchers.

Prerequisites: A basic knowledge of probability theory, and linear algebra should be enough. Familiarity with standard concentration inequalities, state of the art multiarmed bandits (MAB) algorithms would be helpful (only to understand the algorithm technicalities), but not necessary and as mentioned, we will cover the basics of classical MAB techniques in the beginning of the talk. The tutorial will be self-contained with all the basic definitions.

Most of the target audiences are likely to be Machine Learning oriented, cutting across grad students, postdocs, or faculties. Overall, any first year grad student is expected to be comfortable. The tutorial intends to provide enough exposure to the audience to built a basic understanding of bandit-problems, the need of its preference counterpart, existing results, and exciting scopes of open challenges.

Tutorial Materials

Online Learning (book): Prediction, Learning and Games by Nicolo Cesa-Bianchi and Gabor Lugosi, Cambridge University Press, 2006.

Multiarmed Bandit (book): Bandit Algorithms by Tor Lattimore and Csaba Szepesv´ari

Online Optimization: Tutorial by Sebastin Bubeck, and Elad Hazan

Online Prediction and Learning course (by Aditya Gopalan): You can also check some of the course lecture notes suggested here.

Survey of Dueling Bandits: Bengs et al. (2021)

You are also welcome to check some of our recent publications on online/bandit learning from preference feedback in complex environments.