NeurIPS'23 Tutorial: Do you Prefer to Learn from Preferences?

Do You Prefer Learning with Preferences

Neurips'23 - Dec 11, New Orleans, USA

Aadirupa Saha & Aditya Gopalan

TUTORIAL OVERVIEW

This tutorial intends to cover the development of and recent progress on machine learning with preference-based feedback. where the goal is to sequentially learn the best-action of a decision set from preference feedback over an actively chosen subset of items. We will first cover the basic fundamentals of classical (reward) based multiarmed bandits problem and the limitations of this framework when reward are unknown/ harder to obtain. Drawing motivations from these limitations, we will then start with a brief overview of the motivation and problem formulation and understand the breakthrough results for the simplest pairwise preference setting (where the subsets are of size 2), famously studied as the `Dueling Bandit' problem in the literature. We will further generalize this to the `Battling Bandits' (general subsetwise preference based bandits) framework for subsets of any arbitrary size and understand the tradeoff between learning rates-vs-increasing subset sizes.

Schedule

Part-I: Tutorial [1:45-3:15pm -- 90 mins]
Part-II: Hands-On Demo [3:15-3:45pm -- 30 mins]
Part-III: Panel Discussion: Human-AI Alignment: Preference & Beyond [3:45-4:15pm -- 30 mins]

(*all times are in the local (Central) timezone)

Tutorial Content

Motivation: Learning from Preferences
- - Classical Voting theory (Supervised)
  - Recommender Systems (Online)
  - ChatGPT! (RLHF)
Preference Modeling
- - Pairwise Preferences and Choices
  - Simple Bradley-Terry-Luce (BTL), Mallows, Nested logit, RUMs...
Inferences with Offline Preferences
- - Parametric Estimation
  - Hypothesis testing: BAI, Eps-Ranking
Active Learning through Preferences
- - Dueling Bandits
  - Simple Regret Minimization with Dueling Bandits
  - Pointers to SOTA
Generalization to Subsets
- - Battle of Bandits
  - PAC Rank Recovery with Battling Bandits
  - Pointers to SOTA
Contextual Battling Bandits
- - User-Specific Customization with Preference Feedback
  - Application to OR literature
Online & Non-Stationary Preferences
- - Dynamic Regret in Dueling Bandits
  - Measure of Non-Stationarities
AI Alignment with Preference Feedback
- - Preference-based RL (PbRL)
  - Limitations of Classical RL & Role of PbRL (from Satefy to Reward Engineering)
  - ChatGPT through the lens of RLHF
  - Pointers to SOTA
Hands-on Demo & Future Scopes
- - Simple Rank Aggregation
  - Training Human-like Robots
Panel Discussion: Human-AI Alignment with Preference & Beyond

Target Audience (Prerequisites)

The tutorial is meant to be accessible to the entire machine learning community, and especially useful for bandits and reinforcement learning researchers.

Prerequisites: A basic knowledge of probability theory, and linear algebra should be enough. Familiarity with standard concentration inequalities, state of the art multiarmed bandits (MAB) algorithms would be helpful (only to understand the algorithm technicalities), but not necessary and as mentioned, we will cover the basics of classical MAB techniques in the beginning of the talk. The tutorial will be self-contained with all the basic definitions.

Most of the target audiences are likely to be Machine Learning oriented, cutting across grad students, postdocs, or faculties. Overall, any first year grad student is expected to be comfortable. The tutorial intends to provide enough exposure to the audience to built a basic understanding of bandit-problems, the need of its preference counterpart, existing results, and exciting scopes of open challenges.

Some References

Classical Methods for Learning with Preferences
- - Book - Preference Learning by Johannes Fürnkranz, Eyke Hüllermeier
Active Learning with Preferences
- - Survey - Bengs et al. (2021)
  - Survey old survey - Sui et al. (2018)
Reinforcement Learning with Human Feedback
- - Survey (slightly old) - Wirth et al. (2017), Christiano et al (2017)
  - Blog - Hugging Face, OpenAI
  - Language model paper - Ouyang et al. (2022)
Open Source Code Library
- - ApRel, Pebble, User-RecSys
More general references for Online Learning
- - Book - (1) Bandit Algorithms by Tor Lattimore and Csaba Szepesvari (2) Prediction, Learning and Games by Nicolo Cesa-Bianchi and Gabor Lugosi (3) Introduction to Online Convex Optimization by Elad Hazan [online version].
  - Course Lectures - Online Prediction and Learning course (by Aditya Gopalan)
  - Course on Concentration Inequalities

You are also welcome to check some of our recent publications on online/bandit learning from preference feedback in complex environments.

Page updated

Google Sites

Report abuse