Blair Yang, Fuyang Cui, Keiran Paster, Jimmy Ba, Pashootan Vaezipoor, Silviu Pitis, Michael R. Zhang
The rapid development and dynamic nature of large language models (LLMs) make it difficult for conventional quantitative benchmarks to accurately assess their capabilities. We propose Report Cards, which are human-interpretable, natural language summaries of model behavior for specific skills or topics. We develop a framework to evaluate Report Cards based on three criteria: specificity (ability to distinguish between models), faithfulness (accurate representation of model capabilities), and interpretability (clarity and relevance to humans). We also propose an iterative algorithm for generating Report Cards without human supervision and explore its efficacy by ablating various design choices. Through experimentation with popular LLMs, we demonstrate that Report Cards provide insights beyond traditional benchmarks and can help address the need for a more interpretable and holistic evaluation of LLMs.
1. We introduce Report Cards, a novel approach to interpretable, qualitative evaluations of LLM behavior. Report Cards address the limitations of purely quantitative metrics and provide richer insights into model performance.
2. We propose a set of metrics to evaluate the specificity, faithfulness, and interpretability of Report Cards, which we use to validate our approach on a variety of LLMs.
3. We present PRESS, an iterative algorithm for generating Report Cards that is competitive with less interpretable baselines and robust to test-time paraphrasing. We investigate factors affecting summary quality through extensive ablation studies.
LLM evaluation balances simplicity and comprehensiveness. Summary statistics enable easy comparisons but lack robustness, while showcasing outputs preserves information but is unwieldy for large datasets.
We propose LLM-generated Report Cards as an automatic, human-interpretable evaluation method, summarizing LLM behaviour for specific skills or topics. Report Cards aim for specificity, faithfulness, and interpretability.
Click for more examples at https://www.cs.toronto.edu/~michael/model_comparison.html
A Report Card consists of multiple bullet points, with each point summarizing the strengths of the weakness in a certain area.
Specificity: Description of unique and nuanced feature of LLMs.
Faithfulness: Accurate and faithful with respect to students' capability for a skill.
Interpretability: Human readable, relevant, and informative.
A Report Card should accurately describe unique aspects of model behaviour, so that it may be used to distinguish between models. We measure the specificity of Report Cards using a contrastive accuracy metric, which assesses how well two student models can be matched given their Report cards and a quiz of k test questions completed from both students.
The specific behaviours described by a Report Card, taken as a whole, should accurately capture the model's overall capability with respect to the skill it describes.
To measure faithfulness, we use the Elo derived from pairwise comparisons of Report Cards. If the card-based Elo ratings are similar to completion-based ones, then Report Cards are faithful to the genuine capabilities of the model. We quantify this by computing the R2 between the two sets of Elo ratings.
Report Cards are meant to be read by humans, but it is conceivable that the guesser and judge, being LLMs, could find a human-unreadable Report Card to be both specific and faithful.
Human evaluators assess interpretability by scoring clarity, relevance, and informativeness on a 5-point Likert scale. Volunteers familiar with the subject matter conduct these evaluations.
PRESS iteratively refines Report Cards using an evaluator LLM. It starts with a subset of questions, then progressively incorporates new insights.
Each iteration involves capturing specific performance aspects and merging new information with previous summaries.
PRESS produces more nuanced and comprehensive capability descriptions than one-pass prompting, which may overlook subtle behaviours.
We used two baselines for comparison: a constant predictor that favors the stronger model based on overall dataset performance, and a few-shot approach that uses limited samples from each model's training set to mimic human comparison without detailed summaries.
PRESS outperforms all baselines on MMLU sub-topics, showcasing its ability to capture model capabilities in reasoning-related domains.
We investigated the impact of stylistic features, completions were "de-stylized" while preserving content.
With de-stylized completions, PRESS-generated Report Cards show the strongest contrastive specificity, while the few-shot baseline experiences significant reductions.
Report Card Elo consistently achieves the highest faithfulness scores while requiring the least number of comparisions.
We compare faithfulness and specificity of various Report Card generation methods.
PRESS outperforms baselines in nearly all topics. The improvement from the first to the last iteration of PRESS is significant.
Human evaluators consistently rated Report Cards highly (above 4 out of 5).
MMLU subtopics scored slightly lower than Advanced AI Safety Risk topics.
We thank Jessica Bo, Marta Skreta, Ian Berlot-Attwell, and Ramaravind Kommiya Mothilal for helpful discussions and feedback on our work. Farnaz Kohankhaki and John Willes and others from the Vector Institute AI Engineering team helped develop and contribute to data collection.