AI Safety & Alignment
Course Staff
Instructor:
Prof. Elad Hazan, CS building 409, Office hours: after class, or by appointment.
Teaching Assistants:
Xinyi Chen, CS building 421, Office hours: after class, or by appointment.
Jennifer Sun, CS building 421, Office hours: after class, or by appointment.
Undergrad TA:
Course Description & Basic Information
What existential and social risks do modern AI systems and methods pose? How can we mitigate these risks and ensure that AI systems are safe and aligned with the intention of their builder? What is currently being done to ensure that AI systems are safe? Are current safeguards sufficient and if not how can we improve upon them?
These are the questions we will consider in this advanced seminar. Diverse topics from a variety of disciplines will be considered, including algorithms and optimization-based methods for alignment, the mind-body problem as applied to AI, accountability and the free choice problem, the forcasted economic impact of AI breakthroughs, and more.
The topics in this course will not include AI fairness, as this important topic is covered in other courses.
The course will present talks by leading thinkers on the topic of AI alignment and safety, as well as student presentations on selected readings chosen by the instructors.
This is an advanced graduate level course, open to all graduate and undergraduate students, although extensive preparation in machine learning / artificial intelligence is expected (COS 324 or equivalent is required).
Reading Materials and Related Courses
(Note: this literature does not include fairness in AI and ethics, as it relates to fairness w.r.t. different populations, which are very important topics that are the main subject of other courses at Princeton)
Contents
Introduction
Capabilities and Scaling
Rewards and Goals
Understanding and aligning ethics
Adversarial attacks and red teaming
Game theoretic approaches to alignment
Interpretability
Economic implications of AGI
AGI implications for hacking / computer security
Criticism on AI risk
Introduction
The alignment problem from a deep learning perspective, Ngo et al. (2022)
[Blog] Why AI alignment could be hard with modern deep learning, Cotra (2021)
[Blog] Introducing Superalignment, Leike and Sutskever (2023) [OpenAI]
[Statement] Statement on AI risk, Center for AI Safety (2023)
[Documentary review] Unknown: Killer Robots review – the future of AI will fill you with unholy terror, The Guardian (2023).
[Product demo video] Palantir AIP for Defense, Palantir (2023)
[News article, difference between AI ethics and AI safety] There are two factions working to prevent AI dangers. Here’s why they’re deeply divided. Piper (2022)
[News article, context surrounding EA and Rationalist communities in AI Safety] The Reluctant Prophet of Effective Altruism, Lewis Kraus (2022).
Supplementary
[Forums used to discuss AI Safety] AI Alignment Forum, LessWrong, Effective Altruism Forum
[Resource page] AI Safety Support - Lots of Links
[Student group] Princeton AI Alignment
Artificial intelligence and the problem of control, Russell (2019)
[Book] Human Compatible: Artificial Intelligence and the Problem of Control, Russell (2019)
[Book] The Alignment Problem: Machine Learning and Human Values, Christian (2020)
Capabilities and Scaling
[Blog] Biological Anchors: A Trick That Might Or Might Not Work, Alexander (2022)
Scaling laws for neural language models, Kaplan et al. (2020) [OpenAI]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher. Rae et al. (2021) [Deepmind]
Training Compute-Optimal Large Language Models, Hoffman et al. (2022) [Deepmind]
Sparks of Artificial General Intelligence: Early experiments with GPT-4, Bubeck et al. (2023) [Microsoft]
Video summary: https://youtu.be/Mqg3aTGNxZ0
Recommended
[Blog] Future ML Systems Will Be Qualitatively Different, Steinhardt (2020)
Evaluating Language-Model Agents on Realistic Autonomous Tasks, Kinniment et al. (2023)
Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al. (2023)
Supplementary
Literature review of Transformative Artificial Intelligence timelines, Wynroe et al. (2023)
Are Emergent Abilities of Large Language Models a Mirage? Schaeffer et al. (2023)
Reward and Goals
Reward Misspecification
[Blog] Specification gaming: the flip side of AI ingenuity, Krakovna et al. (2020)
Deep reinforcement learning from human preferences, Christiano et al. (2017) [OpenAI]
Supplementary:
Scaling Laws for Reward Model Overoptimization, Gao et al. (2022) [OpenAI]
Cooperative Inverse Reinforcement Learning, Hadfield-Menell et al. (2016)
Survival Instinct in Offline Reinforcement Learning, Li et al. (2023)
Goal Misgeneralization
Supplementary
Constitutional AI
Constitutional AI: Harmlessness from AI Feedback, Bai et al. (2022) [Anthropic, Askell] (Also a scalable oversight method)
Supplementary
Training language models to follow instructions with human feedback, Ouyang et al. (2022)
A General Language Assistant as a Laboratory for Alignment, Askell et al. (2022) [Anthropic, Askell]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Bai et al. (2022) [Anthropic, Askell]
Measuring Faithfulness in Chain-of-Thought Reasoning, Lanham et al. (2023) [Anthropic]
Understanding and Aligning Ethics
Supplementary:
What Would Jiminy Cricket Do? Towards Agents That Behave Morally, Hendrycks et al. (2023) [Steinhardt]
Evaluating the Moral Beliefs Encoded in LLMs, Scherrer et al. (2023)
Let’s Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning, Ma et al. (2023)
Adversarial attacks and red-teaming
Universal and Transferable Adversarial Attacks on Aligned Language Models, Zou, et al. (2023)
[News Article] Researchers Poke Holes in Safety Controls of ChatGPT and Other Chatbots, Metz (2023)
[Announcement + job posting] Anthropic \ Frontier Threats Red Teaming for AI Safety (2023)
Discovering Language Model Behaviors with Model-Written Evaluations, Perez et al. (2022) [Anthropic]
Recommended
Fundamental Limitations of Alignment in Large Language Models, Wolf et al. (2023)
Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks, Casper et al. (2023)
Supplementary
Universal Adversarial Triggers for Attacking and Analyzing NLP, Wallace et al. (2021)
Explore, Establish, Exploit: Red Teaming Language Models from Scratch, Casper et al. (2023)
Red Teaming Language Models with Language Models, Perez et al. (2022) [Deepmind]
FLIRT: Feedback Loop In-context Red Teaming, Mehrabi et al. (2023)
Game theoretic approaches
AI safety via debate, Irving et al. (2018) [OpenAI, Christiano]
Measuring Progress on Scalable Oversight for Large Language Models, Perez et al. (2022) [Anthropic]
Supplementary
Interpretability
Discovering Latent Knowledge
Discovering Latent Knowledge in Language Models Without Supervision, Burns et al. (2022) [Steinhardt]
Supplementary
Steering output with interpretability
Steering GPT-2-XL by adding an activation vector, Turner et al. (2023)
Locating and Editing Factual Associations in GPT, Meng et al. (2022)
Supplementary
Interpretability and science of deep learning
Towards Developmental Interpretability, Hoogland et al. (2023)
Studying Large Language Model Generalization with Influence Functions, Grosse et al. (2023)
Mechanistic interpretability
[Informal Note] Interpretability Dreams, Olah (2023) [Anthropic]
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small, Wang et al. (2022) [Steinhardt]
Supplementary
Language models can explain neurons in language models, Bills et al (2023) [OpenAI]
Progress measures for grokking via mechanistic interpretability, Nanda et al. (2023)
Economic Implications of AGI
Economic Growth under Transformative AI, Trammell et al. (2020)
Machines of mind: The case for an AI-powered productivity boom, Neil et al. (2023)
Computer security implications of AGI
Criticism on AI risk and safety
General
The implausibility of intelligence explosion, Chollet (2017)
Superintelligence: The Idea That Eats Smart People (idlewords.com)
Don't Fear the Terminator - Scientific American Blog Network
Munk Debate on Artificial Intelligence | Bengio & Tegmark vs. Mitchell & LeCun - YouTube
From AI ethics
Relevant courses
Resource page
Classical materials
Administrative Information
Lectures: Wed, 13:30-16:20 in Robertson Hall, room 016.
NOTICE: attendence will be allowed for registered students only
Discussion channel: please see the canvas system for this course, which has an Ed channel.
Requirements: This is an advanced undergraduate/graduate-level seminar that requires independent thinking, reasoning and presentation ability, but extensive preparation in machine learning / artificial intelligence is expected. COS 324 or equivalent is required.
Attendance and the use of electronic devices: Attendance is expected at all meetings. The use of laptops and similar devices for note-taking is permitted. Food is not allowed, but hot/cold beverages are allowed.
comic by skeletonclaw, seen from slides by Devon Wood-Thomas