AI Safety & Alignment

Course Staff

Instructor:
Prof. Elad Hazan, CS building 409, Office hours: after class, or by appointment.

Teaching Assistants:

Xinyi Chen, CS building 421, Office hours: after class, or by appointment.

Jennifer Sun, CS building 421, Office hours: after class, or by appointment.

Undergrad TA:

Waree Sethapun

Course Description & Basic Information

What existential and social risks do modern AI systems and methods pose? How can we mitigate these risks and ensure that AI systems are safe and aligned with the intention of their builder? What is currently being done to ensure that AI systems are safe? Are current safeguards sufficient and if not how can we improve upon them?
These are the questions we will consider in this advanced seminar. Diverse topics from a variety of disciplines will be considered, including algorithms and optimization-based methods for alignment, the mind-body problem as applied to AI, accountability and the free choice problem, the forcasted economic impact of AI breakthroughs, and more.

The topics in this course will not include AI fairness, as this important topic is covered in other courses.

The course will present talks by leading thinkers on the topic of AI alignment and safety, as well as student presentations on selected readings chosen by the instructors.

This is an advanced graduate level course, open to all graduate and undergraduate students, although extensive preparation in machine learning / artificial intelligence is expected (COS 324 or equivalent is required).

Reading Materials and Related Courses

(Note: this literature does not include fairness in AI and ethics, as it relates to fairness w.r.t. different populations, which are very important topics that are the main subject of other courses at Princeton)

Contents

Introduction
Capabilities and Scaling
Rewards and Goals
Understanding and aligning ethics
Adversarial attacks and red teaming
Game theoretic approaches to alignment
Interpretability
Economic implications of AGI
AGI implications for hacking / computer security
Criticism on AI risk

Introduction

The alignment problem from a deep learning perspective, Ngo et al. (2022)
Concrete Problems in AI Safety, Amodei et al. (2016)
[Blog] Why AI alignment could be hard with modern deep learning, Cotra (2021)
[Blog] Introducing Superalignment, Leike and Sutskever (2023) [OpenAI]
[Statement] Statement on AI risk, Center for AI Safety (2023)
[Documentary review] Unknown: Killer Robots review – the future of AI will fill you with unholy terror, The Guardian (2023).
[Product demo video] Palantir AIP for Defense, Palantir (2023)

[News article, difference between AI ethics and AI safety] There are two factions working to prevent AI dangers. Here’s why they’re deeply divided. Piper (2022)
[News article, context surrounding EA and Rationalist communities in AI Safety] The Reluctant Prophet of Effective Altruism, Lewis Kraus (2022).

Supplementary

[Forums used to discuss AI Safety] AI Alignment Forum, LessWrong, Effective Altruism Forum
[Resource page] AI Safety Support - Lots of Links
[Student group] Princeton AI Alignment
Artificial intelligence and the problem of control, Russell (2019)
If we succeed, Russell (2022)
Is Power-Seeking AI an Existential Risk? Carlsmith (2022)
[Book] Human Compatible: Artificial Intelligence and the Problem of Control, Russell (2019)
[Book] The Alignment Problem: Machine Learning and Human Values, Christian (2020)

Capabilities and Scaling

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Bender et al. (2021)

Recommended

Supplementary

Reward and Goals

Reward Misspecification

Scalable agent alignment via reward modeling: a research direction, Leike et al. (2018) [Deepmind]
- Blog: Scalable agent alignment via reward modeling | by DeepMind Safety Research | Medium

Supplementary:

Goal Misgeneralization

Supplementary

Constitutional AI

Constitutional AI: Harmlessness from AI Feedback, Bai et al. (2022) [Anthropic, Askell] (Also a scalable oversight method)

Supplementary

Measuring Faithfulness in Chain-of-Thought Reasoning, Lanham et al. (2023) [Anthropic]

Understanding and Aligning Ethics

Supplementary:

Adversarial attacks and red-teaming

Universal and Transferable Adversarial Attacks on Aligned Language Models, Zou, et al. (2023)
[News Article] Researchers Poke Holes in Safety Controls of ChatGPT and Other Chatbots, Metz (2023)
[Announcement + job posting] Anthropic \ Frontier Threats Red Teaming for AI Safety (2023)
Discovering Language Model Behaviors with Model-Written Evaluations, Perez et al. (2022) [Anthropic]

Recommended

Supplementary

Game theoretic approaches

Supplementary

Evaluating Superhuman Models with Consistency Checks, Fluri et al. (2023)

Interpretability

Discovering Latent Knowledge

Supplementary

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model, Li et al. (2023)

Steering output with interpretability

Supplementary

Interpretability and science of deep learning

Mechanistic interpretability

[Informal Note] Interpretability Dreams, Olah (2023) [Anthropic]
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small, Wang et al. (2022) [Steinhardt]

Supplementary

Economic Implications of AGI

Computer security implications of AGI

Criticism on AI risk and safety

General

From AI ethics

Relevant courses

Resource page

Classical materials

Administrative Information

Lectures: Wed, 13:30-16:20 in Robertson Hall, room 016.
NOTICE: attendence will be allowed for registered students only

Discussion channel: please see the canvas system for this course, which has an Ed channel.

Requirements: This is an advanced undergraduate/graduate-level seminar that requires independent thinking, reasoning and presentation ability, but extensive preparation in machine learning / artificial intelligence is expected. COS 324 or equivalent is required.

Attendance and the use of electronic devices: Attendance is expected at all meetings. The use of laptops and similar devices for note-taking is permitted. Food is not allowed, but hot/cold beverages are allowed.

comic by skeletonclaw, seen from slides by Devon Wood-Thomas