Large language models (LLMs) have demonstrated remarkable capabilities, yet ensuring that their behavior aligns with human intent remains one of the most pressing challenges for machine learning. This tutorial provides a comprehensive, machine learning–oriented overview of alignment methods for LLMs. We begin by motivating alignment through classic concerns such as reward hacking, societal bias, and power-seeking behaviors, framing alignment as a recurring cycle of forward training and backward refinement. We then trace the origins of alignment with reward models to inverse reinforcement learning, which laid the foundation for preference-based learning and the modern RLHF pipeline, as well as its recent extensions such as reinforcement learning from AI feedback (RLAIF) and Constitutional AI. Building on the same perspective, we show how direct preference optimization (DPO) and its generalizations can be derived as special cases of Inverse Preference Learning (IPL), offering a simpler alternative to RLHF. We further explore general preference modeling that captures non-transitive preferences, and verifier-based alignment that integrates external evaluators into training. Throughout, we emphasize a fundamental paradigm shift in AI alignment—from passive learning constrained by human-labeled data to autonomous self-improvement through experience. This tutorial aims to equip the community with a unified perspective on alignment methods and to chart key directions toward building safe, reliable, and value-aligned LLMs.
Dr. Yaodong Yang is a Boya Assistant Professor at the Institute for Artificial Intelligence of Peking University. His research focuses on safe interaction and alignment.
Mingzhi Wang is a research assistant at the Institute for Artificial Intelligence, Peking University, and a Research Fellow at the Beijing Academy of Artificial Intelligence (BAAI). His research focuses on game theory and value alignment..