LLM Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries

Blair Yang, Fuyang Cui, Keiran Paster, Jimmy Ba, Pashootan Vaezipoor, Silviu Pitis, Michael R. Zhang

Abstract

The rapid development and dynamic nature of large language models (LLMs) make it difficult for conventional quantitative benchmarks to accurately assess their capabilities. We propose Report Cards, which are human-interpretable, natural language summaries of model behavior for specific skills or topics. We develop a framework to evaluate Report Cards based on three criteria: specificity (ability to distinguish between models), faithfulness (accurate representation of model capabilities), and interpretability (clarity and relevance to humans). We also propose an iterative algorithm for generating Report Cards without human supervision and explore its efficacy by ablating various design choices. Through experimentation with popular LLMs, we demonstrate that Report Cards provide insights beyond traditional benchmarks and can help address the need for a more interpretable and holistic evaluation of LLMs.

Key Contributions

1. We introduce Report Cards, a novel approach to interpretable, qualitative evaluations of LLM behavior. Report Cards address the limitations of purely quantitative metrics and provide richer insights into model performance.

2. We propose a set of metrics to evaluate the specificity, faithfulness, and interpretability of Report Cards, which we use to validate our approach on a variety of LLMs.

3. We present PRESS, an iterative algorithm for generating Report Cards that is competitive with less interpretable baselines and robust to test-time paraphrasing. We investigate factors affecting summary quality through extensive ablation studies.

The Role of Qualitative Evaluation

LLM evaluation balances simplicity and comprehensiveness. Summary statistics enable easy comparisons but lack robustness, while showcasing outputs preserves information but is unwieldy for large datasets.

We propose LLM-generated Report Cards as an automatic, human-interpretable evaluation method, summarizing LLM behaviour for specific skills or topics. Report Cards aim for specificity, faithfulness, and interpretability.

Click for more examples at https://www.cs.toronto.edu/~michael/model_comparison.html

Report Card Excerpt

A Report Card consists of multiple bullet points, with each point summarizing the strengths of the weakness in a certain area.

Metrics in Evaluating the Effectiveness of Report Cards

Criteria for good qualitative evaluations

Specificity: Description of unique and nuanced feature of LLMs.

Faithfulness: Accurate and faithful with respect to students' capability for a skill.

Interpretability: Human readable, relevant, and informative.

Specificity

A Report Card should accurately describe unique aspects of model behaviour, so that it may be used to distinguish between models. We measure the specificity of Report Cards using a contrastive accuracy metric, which assesses how well two student models can be matched given their Report cards and a quiz of k test questions completed from both students.

Faithfulness

The specific behaviours described by a Report Card, taken as a whole, should accurately capture the model's overall capability with respect to the skill it describes.

To measure faithfulness, we use the Elo derived from pairwise comparisons of Report Cards. If the card-based Elo ratings are similar to completion-based ones, then Report Cards are faithful to the genuine capabilities of the model. We quantify this by computing the R2 between the two sets of Elo ratings.

Interpretability

Report Cards are meant to be read by humans, but it is conceivable that the guesser and judge, being LLMs, could find a human-unreadable Report Card to be both specific and faithful.

Human evaluators assess interpretability by scoring clarity, relevance, and informativeness on a 5-point Likert scale. Volunteers familiar with the subject matter conduct these evaluations.

PRESS - Generating High Quality Cards

Progressive Refinement for Effective Skill Summarization (PRESS)

PRESS iteratively refines Report Cards using an evaluator LLM. It starts with a subset of questions, then progressively incorporates new insights.

Each iteration involves capturing specific performance aspects and merging new information with previous summaries.

PRESS produces more nuanced and comprehensive capability descriptions than one-pass prompting, which may overlook subtle behaviours.

We used two baselines for comparison: a constant predictor that favors the stronger model based on overall dataset performance, and a few-shot approach that uses limited samples from each model's training set to mimic human comparison without detailed summaries.

Contrastive Specificity

PRESS outperforms all baselines on MMLU sub-topics, showcasing its ability to capture model capabilities in reasoning-related domains.

Report Cards Capture Genuine Skill Difference

We investigated the impact of stylistic features, completions were "de-stylized" while preserving content.

With de-stylized completions, PRESS-generated Report Cards show the strongest contrastive specificity, while the few-shot baseline experiences significant reductions.

Report Cards Are Faithful to Genuine Capabilities

Report Card Elo consistently achieves the highest faithfulness scores while requiring the least number of comparisions.

PRESS Achieves Better Quality

We compare faithfulness and specificity of various Report Card generation methods.

PRESS outperforms baselines in nearly all topics. The improvement from the first to the last iteration of PRESS is significant.

Report Cards are Human Interpretable

Human evaluators consistently rated Report Cards highly (above 4 out of 5).

MMLU subtopics scored slightly lower than Advanced AI Safety Risk topics.

Acknowledgements

We thank Jessica Bo, Marta Skreta, Ian Berlot-Attwell, and Ramaravind Kommiya Mothilal for helpful discussions and feedback on our work. Farnaz Kohankhaki and John Willes and others from the Vector Institute AI Engineering team helped develop and contribute to data collection.

Page updated

Google Sites

Report abuse