A Tutorial on General Evaluation of AI Agents

General Evaluation of AI Agents

AAMAS 2025 Tutorial

Artificial Intelligence (AI) and machine learning (ML), in particular, have emerged as scientific disciplines concerned with understanding and building single and multi-agent systems with the ability to act and perform as humans do in a variety of contexts [49]. As is true for any scientific discipline, it is critically important to identify and measure scientific progress in AI and ML [40]. However, overall progress in AI and ML is often measured indirectly by evaluating tangible research artifacts, such as models, agents/algorithms, and architectures, on specific tasks (e.g., datasets, benchmarks, or suites) [46]. In particular, the field has designed and constructed thousands of tasks, benchmarks, and datasets for testing wide-ranging capabilities [44]. This proliferation of evaluation tasks has also given rise to a wide range of evaluation methodologies, often influenced by community-driven dynamics and the particularities of each area [12, 17, 35, 37]. Unsurprisingly, AI and ML evaluation methods and practices have undergone numerous critique-review cycles [7, 15, 25, 28, 34]. Nevertheless, there has been steady progress toward gaining a foundational understanding of evaluation in recent years. Techniques from statistics [1, 8], game theory [5, 39, 41] or social choice theory [31, 48] have offered more principled approaches. However, today, with the deployment of increasingly complex models, agents, and systems [4, 19, 43] that tackle evermore challenging tasks [10, 26, 50], there is a growing need to execute well-grounded and transparent evaluations [6, 9]. Thus, substantial work remains to build the conceptual and methodological foundations to accomplish such goals.

This tutorial covers the fundamentals of the AI evaluation problem. In Part I, we thoroughly review existing methodologies, including statistics, probabilistic choice models, game theory, social choice theory, and graph theory. Then, Part II presents a unifying decision-theoretic perspective of the problem, reviews common pitfalls originating from the unprincipled applications of different methodologies introduced in Part I, and offers principled recipes to avoid these issues in practice. The learning outcomes of this tutorial include 1) an understanding of some of the challenges and pitfalls that arise with an evaluation of AI systems, 2) an introduction to methodologies for the evaluation problem, and 3) the pros and cons of each methodology, including insights as to when and how to apply them.

Tutorial Outline

Part I

Duration: 3 hr

Content: Part I of the tutorial covers the problem of AI evaluation through a detailed review of existing methodologies, including statistics, probabilistic choice models, game theory, social choice theory, and network theory.

Introduction

Statistical Approaches

Introduction & Fundamental Assumptions
Statistics Selection & Practical Limitations
Refs: [1, 24, 29, 30, 53]

Probabilistic Choice Models

Introduction: Bradley-Terry & Plackett-Luce Models
Elo: Foundations, Properties, and Pitfalls.
Refs: [11, 27, 51, 52]

Game-Theoretic Rating & Ranking

Introduction: Game-Theoretic Evaluation
Nash Averaging.
General Sum Equilibrium & N-Player Rating
Evolutionary Dynamics for Evaluation.
Refs: [5, 38, 41]

Social Choice Theory

Introduction: Evaluating Agents using Social Choice Theory: Voting-as-Evaluation (VasE) and Vote N’ Rank.
Probabilistic Social Choice
Social Choice Ranking as Optimization
Refs: [31, 32, 48]

Graph-based Methods [Optional]

Introduction: Comparison Graphs.
Laplacian Null-Space & Markov Chain Rating
Refs: [13]

Part II

Duration: 1 hr

Content: Part II presents a unifying perspective of the problem by framing AI evaluation as designers’ decision-making. It defines the structure of AI evaluation and its three fundamental axes, reviews common pitfalls originating from unprincipled applications of different methodologies introduced in Part I, and offers decision-theoretic recipes to avoid these issues.

The Structure of AI Evaluation

The Task-Artifact-Context Structure
Metrics as Designer’s Utility
Principled Reductions

Decision Theory for Evaluation

Motivation
Common Pitfalls in Practice
Recipes

Towards A Science of Evaluation

Presenters

Manfred Diaz

University of Montreal

Marc Lanctot

Google DeepMind

Kate Larson

University of Waterloo

Google DeepMind

Ian Gemp

Google DeepMind

Relevant References

[1] Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. 2021. Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems 34 (2021).

[2] Stefano V. Albrecht and Subramanian Ramamoorthy. 2012. Comparative evaluation of MAL algorithms in a diverse set of ad hoc team problems. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1 (Valencia, Spain) (AAMAS ’12). International Foundation forAutonomous Agents and Multiagent Systems, Richland, SC, 349–356.

[3] Thomas Anthony, Tom Eccles, Andrea Tacchetti, János Kramár, Ian Gemp, Thomas Hudson, Nicolas Porcel, Marc Lanctot, Julien Pérolat, Richard Everett, Satinder Singh Singh, Thore Graepel, and Yoram Bachrach. 2020. Learning to play no-press diplomacy with best response policy iteration. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS’20). Curran Associates Inc., 17987–18003.

[4] Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. Technical Report.

[5] David Balduzzi, Karl Tuyls, Julien Perolat, and Thore Graepel. 2018. Re-evaluating evaluation. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (Montréal, Canada) (NIPS’18). Curran Associates Inc., Red Hook, NY, USA, 3272–3283.

[6] Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. 2023. Open LLM Leaderboard. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard.

[7] Ryan Burnell, Wout Schellaert, John Burden, Tomer D Ullman, Fernando Martinez-Plumed, Joshua B Tenenbaum, Danaja Rutar, Lucy G Cheke, Jascha Sohl-Dickstein, Melanie Mitchell, Douwe Kiela, Murray Shanahan, Ellen M Voorhees, Anthony G Cohn, Joel Z Leibo, and Jose Hernandez-Orallo. 2023. Rethink reporting of evaluation results in AI. Science 380, 6641 (April 2023), 136–138.

[8] Stephanie C Y Chan, Samuel Fishman, Anoop Korattikara, John Canny, and Sergio Guadarrama. 2020. Measuring the Reliability of Reinforcement Learning Algorithms. In International Conference on Learning Representations.

[9] Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, and Ion Stoica. 2024. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. (March 2024). arXiv:2403.04132 [cs.AI]

[10] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems. (Oct. 2021). arXiv:2110.14168 [cs.LG]

[11] Rémi Coulom. 2008. Whole-history rating: A Bayesian rating system for players of time-varying strength. In Computers and Games. Springer Berlin Heidelberg, Berlin, Heidelberg, 113–124.

[12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.

[13] Stephen Devlin and Thomas Treloar. 2018. A network diffusion ranking family that includes the methods of Markov, Massey, and Colley. Journal of Quantitative Analysis in Sports 14, 3 (Sept. 2018), 91–101.

[14] Manfred Diaz and Aurélien Bück-Kaeffer. 2023. PopRank: A Rating Library for Population-based Training. url{https://github.com/poprl/poprank.

[15] Kawin Ethayarajh and Dan Jurafsky. 2020. Utility is in the Eye of the User: A Critique of NLP Leaderboards. In Proceedings of the 2020 Conference on Empirical. Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 4846–4853.

[16] Richard Everett, Adam Cobb, Andrew Markham, and Stephen Roberts. 2019. Optimising Worlds to Evaluate and Influence Reinforcement Learning Agents. In Proceedings of the 18th International Conference on Autonomous Agents and Multi- Agent Systems (Montreal QC, Canada) (AAMAS ’19). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 1943–1945.

[17] M Everingham, L Van Gool, C K I Williams, J Winn, and A Zisserman. 2010. The Pascal Visual Object Classes (VOC) Challenge. International journal of computer vision 88, 2 (June 2010), 303–338.

[18] Emilia Garcia, Adriana Giret, and Vicente Botti. 2010. An evaluation tool for multiagent development techniques. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: Volume 1 - Volume 1 (Toronto, Canada) (AAMAS ’10). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 1625–1626.

[19] Gemini Team. 2023. Gemini: A Family of Highly Capable Multimodal Models. (Dec. 2023). arXiv:2312.11805 [cs.CL]

[20] Ian Gemp, Thomas Anthony, Yoram Bachrach, Avishkar Bhoopchand, Kalesha Bullard, Jerome Connor, Vibhavari Dasagi, Bart De Vylder, Edgar A Duéñez-Guzmán, Romuald Elie, Richard Everett, Daniel Hennes, Edward Hughes, Mina Khan, Marc Lanctot, Kate Larson, Lever Guy, Siqi Liu, Luke Marris, Kevin R. McKee, Paul Muller, Julien Pérolat, Florian Strub, Andrea Tacchetti, EugeneTarassov, Zhe Wang, and Karl Tuyls. 2022. Developing, evaluating and scaling learning agents in multi-agent environments. AI Communications 35, 4 (2022), 271–284.

[21] Ian Gemp, Marc Lanctot, Luke Marris, Yiran Mao, Edgar Duéñez-Guzmán, Sarah Perrin, Andras Gyorgy, Romuald Elie, Georgios Piliouras, Michael Kaisers, Daniel Hennes, Kalesha Bullard, Kate Larson, and Yoram Bachrach. 2024. Approximating the Core via Iterative Coalition Sampling. In Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems. 669–678.

[22] Ian Gemp, Rahul Savani, Marc Lanctot, Yoram Bachrach, Thomas Anthony, Richard Everett, Andrea Tacchetti, Tom Eccles, and János Kramár. 2022. Sample-based Approximation of Nash in Large Many-Player Games via Gradient Descent. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems. 507–515.

[23] Aditya Grover, Maruan Al-Shedivat, Jayesh K. Gupta, Yuri Burda, and Harrison Edwards. 2018. Evaluating Generalization in Multiagent Systems using Agent-Interaction Graphs. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems (Stockholm, Sweden) (AAMAS’18). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 1944–1946.

[24] Trevor Hastie. 2009. The elements of statistical learning: data mining, inference, and prediction.

[25] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. 2018. Deep reinforcement learning that matters. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence (New Orleans, Louisiana, USA) (AAAI’18/IAAI’18/EAAI’18, Article 392). AAAI Press, 3207–3214.

[26] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring Massive Multitask Language Understanding. (Sept. 2020). arXiv:2009.03300 [cs.CY]

[27] David R Hunter. 2004. MM algorithms for generalized Bradley-Terry models. The Annals of Statistics 32, 1 (Feb. 2004), 384–406.

[28] Nathalie Japkowicz. 2006. Why Question Machine Learning Evaluation Methods? (An illustrative review of the shortcomings of current methods). In AAAI Workshop Papers.

[29] Scott Jordan, Yash Chandak, Daniel Cohen, Mengxue Zhang, and Philip Thomas. 2020. Evaluating the performance of reinforcement learning algorithms. In International Conference on Machine Learning. PMLR, 4962–4973.

[30] Scott M Jordan, Adam White, Bruno Castro Da Silva, Martha White, and Philip S Thomas. 2024. Position: Benchmarking is limited in reinforcement learning research. arXiv preprint arXiv:2406.16241 (2024).

[31] Marc Lanctot, Kate Larson, Yoram Bachrach, Luke Marris, Zun Li, Avishkar Bhoopchand, Thomas Anthony, Brian Tanner, and Anna Koop. 2023. Evaluating Agents using Social Choice Theory. arXiv:2312.03121 [cs.AI]

[32] Marc Lanctot, Kate Larson, Michael Kaisers, Quentin Berthet, Ian Gemp, Manfred Diaz, Roberto-Rafael Maura-Rivero, Yoram Bachrach, Anna Koop, and Doina Precup. 2024. Soft Condorcet Optimization for Ranking of General Agents. arXiv:2411.00119 [cs.MA]

[33] Marc Lanctot, John Schultz, Neil Burch, Max Olan Smith, Daniel Hennes, Thomas Anthony, and Julien Perolat. 2023. Population-based Evaluation in Repeated Rock-Paper-Scissors as a Benchmark for Multiagent Reinforcement Learning. Transactions on Machine Learning Research (2023).

[34] Thomas Liao, Rohan Taori, Inioluwa Deborah Raji, and Ludwig Schmidt. 2021. Are We Learning Yet? A Meta-Review of Evaluation Failures Across Machine Learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).

[35] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C Lawrence Zitnick, and Piotr Dollár. 2014. Microsoft COCO: Common Objects in Context. (May 2014). arXiv:1405.0312 [cs.CV]

[36] Emiliano Lorini. 2021. A Logic of Evaluation (AAMAS ’21). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 827–835.

[37] Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. 2018. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents. The journal of artificial intelligence research 61 (March 2018), 523–562.

[38] Luke Marris, Marc Lanctot, Ian Gemp, Shayegan Omidshafiei, Stephen McAleer, Jerome Connor, Karl Tuyls, and Thore Graepel. 2022. Game Theoretic Rating in N-player general-sum games with Equilibria. arXiv:2210.02205 [cs.GT]

[39] Luke Marris, Marc Lanctot, Ian Gemp, Shayegan Omidshafiei, Stephen McAleer, Jerome Connor, Karl Tuyls, and Thore Graepel. 2022. Game Theoretic Rating in N-player general-sum games with Equilibria. arXiv [cs.GT] (Oct. 2022).

[40] Ilkka Niiniluoto. 2024. Scientific Progress (spring 2024 ed.). Metaphysics Research Lab, Stanford University.

[41] Shayegan Omidshafiei, Christos Papadimitriou, Georgios Piliouras, Karl Tuyls, Mark Rowland, Jean-Baptiste Lespiau, Wojciech M Czarnecki, Marc Lanctot, Julien Perolat, and Remi Munos. 2019. 𝛼- Rank: Multi-Agent Evaluation by Evolution. Scientific reports 9, 1 (July 2019), 9937.

[43] OpenAI. 2023. GPT-4 Technical Report. (March 2023). arXiv:2303.08774 [cs.CL]

[44] Papers with Code. 2024. Papers with Code - Machine Learning. https://paperswithcode.com/. Accessed: 2024-6-21.

[45] Roma Patel, Marta Garnelo, Ian Gemp, Chris Dyer, and Yoram Bachrach. 2021. Game-theoretic vocabulary selection via the shapley value and banzhaf index. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2789–2798.

[46] Deborah Raji, Emily Denton, Emily M Bender, Alex Hanna, and Amandalynne Paullada. 2021. AI and the Everything in the Whole Wide World Benchmark. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, J Vanschoren and S Yeung (Eds.), Vol. 1.

[47] Matthias Rehm and Peter Rosina. 2008. SecondLife® as an evaluation platform for multiagent systems featuring social interactions. In Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems: Demo Papers (Estoril, Portugal) (AAMAS ’08). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 1663–1664.

[48] Mark Rofin, Vladislav Mikhailov, Mikhail Florinskiy, Andrey Kravchenko, Elena Tutubalina, Tatiana Shavrina, Daniel Karabekyan, and Ekaterina Artemova. 2022. Vote’n’Rank: Revision of Benchmarking with Social Choice Theory. (Oct. 2022). arXiv:2210.05769 [cs.LG]

[49] Stuart Russell and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach. Pearson.

[50] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. WinoGrande: An Adversarial Winograd Schema Challenge at Scale. (July 2019). arXiv:1907.10641 [cs.CL]

[51] Nihar B Shah, Sivaraman Balakrishnan, Adityanand Guntuboyina, and Martin J Wainwright. 2015. Stochastically Transitive Models for Pairwise Comparisons: Statistical and Computational Issues. arXiv [stat.ML] (Oct. 2015).

[52] Nihar B Shah and Martin J Wainwright. 2018. Simple, Robust and Optimal Ranking from Pairwise Comparisons. Journal of machine learning research: JMLR 18, 199 (2018), 1–38.

[53] Lirong Xia. 2019. Learning and Decision-Making from Rank Data. Springer.

Page updated

Google Sites

Report abuse