Do You Prefer Learning with Preferences
Neurips'23 - Dec 11, New Orleans, USA
Aadirupa Saha & Aditya Gopalan
TUTORIAL OVERVIEW
This tutorial intends to cover the development of and recent progress on machine learning with preference-based feedback. where the goal is to sequentially learn the best-action of a decision set from preference feedback over an actively chosen subset of items. We will first cover the basic fundamentals of classical (reward) based multiarmed bandits problem and the limitations of this framework when reward are unknown/ harder to obtain. Drawing motivations from these limitations, we will then start with a brief overview of the motivation and problem formulation and understand the breakthrough results for the simplest pairwise preference setting (where the subsets are of size 2), famously studied as the `Dueling Bandit' problem in the literature. We will further generalize this to the `Battling Bandits' (general subsetwise preference based bandits) framework for subsets of any arbitrary size and understand the tradeoff between learning rates-vs-increasing subset sizes.
Schedule
Part-I: Tutorial [1:45-3:15pm -- 90 mins]
Part-II: Hands-On Demo [3:15-3:45pm -- 30 mins]
Part-III: Panel Discussion: Human-AI Alignment: Preference & Beyond [3:45-4:15pm -- 30 mins]
(*all times are in the local (Central) timezone)
Tutorial Content
Motivation: Learning from Preferences
Classical Voting theory (Supervised)
Recommender Systems (Online)
ChatGPT! (RLHF)
Preference Modeling
Pairwise Preferences and Choices
Simple Bradley-Terry-Luce (BTL), Mallows, Nested logit, RUMs...
Inferences with Offline Preferences
Parametric Estimation
Hypothesis testing: BAI, Eps-Ranking
Active Learning through Preferences
Dueling Bandits
Simple Regret Minimization with Dueling Bandits
Pointers to SOTA
Generalization to Subsets
Battle of Bandits
PAC Rank Recovery with Battling Bandits
Pointers to SOTA
Contextual Battling Bandits
User-Specific Customization with Preference Feedback
Application to OR literature
Online & Non-Stationary Preferences
Dynamic Regret in Dueling Bandits
Measure of Non-Stationarities
AI Alignment with Preference Feedback
Preference-based RL (PbRL)
Limitations of Classical RL & Role of PbRL (from Satefy to Reward Engineering)
ChatGPT through the lens of RLHF
Pointers to SOTA
Hands-on Demo & Future Scopes
Simple Rank Aggregation
Training Human-like Robots
Panel Discussion: Human-AI Alignment with Preference & Beyond
Target Audience (Prerequisites)
The tutorial is meant to be accessible to the entire machine learning community, and especially useful for bandits and reinforcement learning researchers.
Prerequisites: A basic knowledge of probability theory, and linear algebra should be enough. Familiarity with standard concentration inequalities, state of the art multiarmed bandits (MAB) algorithms would be helpful (only to understand the algorithm technicalities), but not necessary and as mentioned, we will cover the basics of classical MAB techniques in the beginning of the talk. The tutorial will be self-contained with all the basic definitions.
Most of the target audiences are likely to be Machine Learning oriented, cutting across grad students, postdocs, or faculties. Overall, any first year grad student is expected to be comfortable. The tutorial intends to provide enough exposure to the audience to built a basic understanding of bandit-problems, the need of its preference counterpart, existing results, and exciting scopes of open challenges.
Some References
Classical Methods for Learning with Preferences
Book - Preference Learning by Johannes Fürnkranz, Eyke Hüllermeier
Active Learning with Preferences
Survey - Bengs et al. (2021)
Survey old survey - Sui et al. (2018)
Reinforcement Learning with Human Feedback
Survey (slightly old) - Wirth et al. (2017), Christiano et al (2017)
Blog - Hugging Face, OpenAI
Language model paper - Ouyang et al. (2022)
Open Source Code Library
More general references for Online Learning
Book - (1) Bandit Algorithms by Tor Lattimore and Csaba Szepesvari (2) Prediction, Learning and Games by Nicolo Cesa-Bianchi and Gabor Lugosi (3) Introduction to Online Convex Optimization by Elad Hazan [online version].
Course Lectures - Online Prediction and Learning course (by Aditya Gopalan)
Course on Concentration Inequalities
You are also welcome to check some of our recent publications on online/bandit learning from preference feedback in complex environments.