The King is Naked: on the Notion of Robustness for Natural Language Processing
Anthropic Papers on Alignment
Discovering Language Model Behaviors with Model-Written Evaluations
Measuring progress on scalable oversight for large language models
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Scaling Laws and Interpretability of Learning from Repeated Data
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Data and parameter scaling laws for neural machine translation
Evaluating large language models trained on code
Derek Parfit and Development of an Objective Ethics