Large Language Model (LLM) alignment has become an increasingly critical topic in contemporary AI research, especially as LLMs continue to scale and integrate into real-world applications. Ensuring that LLMs generate outputs aligned with human values, preferences, and ethical considerations is essential for their safe and effective deployment. This tutorial aims to provide a comprehensive introduction to LLM alignment methods, offering a structured and accessible entry point for researchers and practitioners interested in the field. It will present key concepts and challenges, introduce fundamental approaches such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), and, building on these foundations, review a spectrum of refinements and variants. In addition, it will cover recent advancements in game-theoretical approach to alignment and theoretical frameworks that provide a deeper understanding of alignment methodologies. Beyond theoretical insights, the tutorial will emphasize the practical aspects of LLM alignment, illustrating how these techniques are applied in real-world scenarios and guiding participants in building intuition about alignment strategies. By the end of the tutorial, attendees will gain a solid foundation in LLM alignment, equipping them with the knowledge needed to critically engage with the field, understand current research trends, and explore future directions.
Alignment for LLMs: Introduction
1. Why Alignment Matters
2. Learning from Human Feedback
Alignment with Reward Models
1. The Path to RLHF
2. Deep Dive into RLHF
3. Challenges of RLHF
Alignment without Reward Models
1. Direct Alignment Algorithms
2. Limitations of Direct Alignment Algorithms
3. Online Direct Alignment Algorithms
4. How to Choose: RLHF or DPO
Alignment with General Preference Models
1. Revisit Stages in Language Model Training
2. Solution Concept
3. Solving the Minmax Winner
Alignment with Verifiers
1. Era of Experience
2. Test-time Scaling Law
3. Verifiable Rewards
4. Process Rewards
Conclusion
Dr. Yaodong Yang is a Boya Assistant Professor at the Institute for Artificial Intelligence of Peking University. His research focuses on safe interaction and alignment.
Mingzhi Wang is a research assistant at the Institute for Artificial Intelligence in Peking University. His research focuses on game theory and value alignment.