Towards AGI - Alignment and Safety

Alignment and Safety

Courses, Surveys and Books

Anthropic's Papers on Alignment

Evaluating and Mitigating Discrimination in Language Model Decisions

Specific versus General Principles for Constitutional AI

Towards Understanding Sycophancy in Language Models

Collective Constitutional AI: Aligning a Language Model with Public Input

Decomposing Language Models Into Understandable Components

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Challenges in evaluating AI systems

Tracing Model Outputs to the Training Data

Studying Large Language Model Generalization with Influence Functions

Measuring Faithfulness in Chain-of-Thought Reasoning

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

Towards Measuring the Representation of Subjective Global Opinions in Language Models

Circuits Updates — May 2023

Interpretability Dreams

Distributed Representations: Composition & Superposition

Privileged Bases in the Transformer Residual Stream

The Capacity for Moral Self-Correction in Large Language Models

Superposition, Memorization, and Double Descent

Page updated

Google Sites

Report abuse