Search this site
Embedded Files
Towards AGI
  • Home
  • Schedule
  • Topics&Papers
    • Adversarial Robustness
    • Alignment and Safety
    • CompPsych-FoMo
    • Compression and Fast Inference
    • Continual Learning at Scale
    • Emergence & Phase Transitions in ML
    • Foundation Models
    • Generalization (iid and ood)
    • High Performance Computing
    • Knowledge Fusion
    • Neural Scaling Laws
    • Out-of-Distribution Generalization
    • Scaling Laws in Nature
    • State Space Models
    • Time Series Foundation Models
Towards AGI
  • Home
  • Schedule
  • Topics&Papers
    • Adversarial Robustness
    • Alignment and Safety
    • CompPsych-FoMo
    • Compression and Fast Inference
    • Continual Learning at Scale
    • Emergence & Phase Transitions in ML
    • Foundation Models
    • Generalization (iid and ood)
    • High Performance Computing
    • Knowledge Fusion
    • Neural Scaling Laws
    • Out-of-Distribution Generalization
    • Scaling Laws in Nature
    • State Space Models
    • Time Series Foundation Models
  • More
    • Home
    • Schedule
    • Topics&Papers
      • Adversarial Robustness
      • Alignment and Safety
      • CompPsych-FoMo
      • Compression and Fast Inference
      • Continual Learning at Scale
      • Emergence & Phase Transitions in ML
      • Foundation Models
      • Generalization (iid and ood)
      • High Performance Computing
      • Knowledge Fusion
      • Neural Scaling Laws
      • Out-of-Distribution Generalization
      • Scaling Laws in Nature
      • State Space Models
      • Time Series Foundation Models

Alignment and Safety

Courses, Surveys and Books

  • Introduction to Machine Learning Safety  (a course by Dan Hendrycks)

  • An Introduction to AI Safety, Ethics, and Society (book)

  • AI alignment: A comprehensive survey  (2024)

  • Unsolved Problems in ML Safety

  • Concrete Problems in AI Safety

  • A General Language Assistant as a Laboratory for Alignment

  • Aligning Language Models to Follow Instructions 

  • Alignment of Language Agents  

  • Alignment Newsletter

  • The King is Naked: on the Notion of Robustness for Natural Language Processing

Anthropic's Papers on Alignment

Evaluating and Mitigating Discrimination in Language Model Decisions

Specific versus General Principles for Constitutional AI

Towards Understanding Sycophancy in Language Models

Collective Constitutional AI: Aligning a Language Model with Public Input

Decomposing Language Models Into Understandable Components

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Challenges in evaluating AI systems

Tracing Model Outputs to the Training Data

Studying Large Language Model Generalization with Influence Functions

Measuring Faithfulness in Chain-of-Thought Reasoning

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

Towards Measuring the Representation of Subjective Global Opinions in Language Models

Circuits Updates — May 2023

Interpretability Dreams

Distributed Representations: Composition & Superposition

Privileged Bases in the Transformer Residual Stream

The Capacity for Moral Self-Correction in Large Language Models

Superposition, Memorization, and Double Descent

  • Discovering Language Model Behaviors with Model-Written Evaluations

  • Constitutional AI: Harmlessness from AI Feedback

  • Measuring progress on scalable oversight for large language models

  • In-context learning and induction heads

  • Toy Models of Superposition

  • Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned

  • Language models (mostly) know what they know

  • Predictability and surprise in large generative models

  • Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

  • Scaling Laws and Interpretability of Learning from Repeated Data

  • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

  • A general language assistant as a laboratory for alignment

  • Data and parameter scaling laws for neural machine translation

  • Evaluating large language models trained on code

    Derek Parfit and Development of an Objective Ethics   

  • Why Anything? Why This?

  •  Reasons and Persons

  • On What Matters (vol 1, vol 2, vol 3)

Google Sites
Report abuse
Page details
Page updated
Google Sites
Report abuse