AI Safety & Alignment 

Course Staff

Instructor:
Prof. Elad Hazan, CS building 409, Office hours:  after class, or by appointment.

Teaching Assistants:  

Xinyi Chen, CS building 421, Office hours:  after class, or by appointment.

Jennifer Sun, CS building 421, Office hours:  after class, or by appointment.

Undergrad TA:

Waree Sethapun

Course Description & Basic Information 

What existential and social risks do modern AI systems and methods pose?  How can we mitigate these risks and ensure that AI systems are safe and aligned with the intention of their builder?  What is currently being done to ensure that AI systems are safe? Are current safeguards sufficient and if not how can we improve upon them? 
These are the questions we will consider in this advanced seminar. Diverse topics from a variety of disciplines will be considered, including algorithms and optimization-based methods for alignment, the mind-body problem as applied to AI, accountability and the free choice problem, the forcasted economic impact of AI breakthroughs, and more.  

The topics in this course will not include AI fairness, as this important topic is covered in other courses. 

The course will present talks by leading thinkers on the topic of AI alignment and safety, as well as student presentations on selected readings chosen by the instructors. 

This is an advanced graduate level course, open to all graduate and undergraduate students, although extensive preparation in machine learning / artificial intelligence is expected (COS 324 or equivalent is required). 

Reading Materials and Related Courses 

(Note: this literature does not include fairness in AI and ethics, as it relates to fairness w.r.t. different populations, which are very important topics that are the main subject of other courses at Princeton)

Contents

Introduction 

Supplementary



Capabilities and Scaling 


Recommended

Supplementary



Reward and Goals

Reward Misspecification 

Supplementary: 


Goal Misgeneralization

Supplementary


Constitutional AI

Supplementary

Measuring Faithfulness in Chain-of-Thought Reasoning, Lanham et al. (2023) [Anthropic]

Understanding and Aligning Ethics

Supplementary:

Adversarial attacks and red-teaming 

Recommended

Supplementary

Game theoretic approaches

Supplementary

Interpretability 


Discovering Latent Knowledge

Supplementary


Steering output with interpretability

Supplementary 


Interpretability and science of deep learning


Mechanistic interpretability 

Supplementary

Economic Implications of AGI

Computer security implications of AGI

Criticism on AI risk and safety

General 

From AI ethics 


Relevant courses

Resource page

Classical materials


Administrative Information

Lectures:  Wed,  13:30-16:20 in Robertson Hall, room 016.
NOTICE: attendence will be allowed for registered students only 

Discussion channel: please see the canvas system for this course, which has an Ed channel.  

Requirements: This is an advanced undergraduate/graduate-level seminar that requires independent thinking, reasoning and presentation ability, but extensive preparation in machine learning / artificial intelligence is expected. COS 324 or equivalent is required. 

Attendance and the use of electronic devices: Attendance is expected at all meetings. The use of laptops and similar devices for note-taking is permitted. Food is not allowed, but hot/cold beverages are allowed.  

comic by skeletonclaw, seen from slides by Devon Wood-Thomas