Search this site

Embedded Files

IFT6760A Winter 2023

Course Description
Projects
Invited Talks
BeyondScaling
Schedule

IFT6760A Winter 2023

Course Description
Projects
Invited Talks
BeyondScaling
Schedule
More

AI Alignment and Safety

Unsolved Problems in ML Safety
Concrete Problems in AI Safety
A General Language Assistant as a Laboratory for Alignment
Aligning Language Models to Follow Instructions
Alignment of Language Agents
Alignment Newsletter
The King is Naked: on the Notion of Robustness for Natural Language Processing

Anthropic Papers on Alignment

Discovering Language Model Behaviors with Model-Written Evaluations
Constitutional AI: Harmlessness from AI Feedback
Measuring progress on scalable oversight for large language models
In-context learning and induction heads
Toy Models of Superposition
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned
Language models (mostly) know what they know
Predictability and surprise in large generative models
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Scaling Laws and Interpretability of Learning from Repeated Data
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
A general language assistant as a laboratory for alignment
Data and parameter scaling laws for neural machine translation
Evaluating large language models trained on code

Derek Parfit and Development of an Objective Ethics

Why Anything? Why This?
Reasons and Persons
On What Matters (vol 1, vol 2, vol 3)

Google Sites

Report abuse

Page details

Page updated

Google Sites

Report abuse