Courses, Surveys and Books
Introduction to Machine Learning Safety (a course by Dan Hendrycks)
The King is Naked: on the Notion of Robustness for Natural Language Processing
Anthropic's Papers on Alignment
Evaluating and Mitigating Discrimination in Language Model Decisions
Specific versus General Principles for Constitutional AI
Towards Understanding Sycophancy in Language Models
Collective Constitutional AI: Aligning a Language Model with Public Input
Decomposing Language Models Into Understandable Components
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Challenges in evaluating AI systems
Tracing Model Outputs to the Training Data
Studying Large Language Model Generalization with Influence Functions
Measuring Faithfulness in Chain-of-Thought Reasoning
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Towards Measuring the Representation of Subjective Global Opinions in Language Models
Distributed Representations: Composition & Superposition
The Capacity for Moral Self-Correction in Large Language Models
Superposition, Memorization, and Double Descent
Discovering Language Model Behaviors with Model-Written Evaluations
Measuring progress on scalable oversight for large language models
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Scaling Laws and Interpretability of Learning from Repeated Data
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Data and parameter scaling laws for neural machine translation
Evaluating large language models trained on code
Derek Parfit and Development of an Objective Ethics