From Benchmarking to Trustworthy AI: Rethinking Evaluation Methods Across Vision and Complex Systems
Tutorial in the 28th European Conference on Artificial Intelligence (ECAI)
Tutorial in the 28th European Conference on Artificial Intelligence (ECAI)
As AI systems are deployed in increasingly complex and uncertain environments, traditional evaluation metrics such as accuracy or reward are no longer sufficient to measure AI’s true capabilities. This tutorial explores the evolving landscape of trustworthy AI evaluation, moving beyond task-specific benchmarks to assess reasoning ability, adaptability, and robustness in dynamic, real-world settings.
We will begin by reviewing the historical development of AI evaluation, highlighting the limitations of static benchmarks in computer vision and reinforcement learning. Then, we introduce recent advances in capability-based, interactive, and multimodal evaluation methods—such as the Visual Turing Test, event-level analysis, and cross-modal alignment techniques. The first half of the tutorial focuses on computer vision tasks, including video reasoning, vision-language grounding, and multimodal robustness. The second half shifts to complex systems, covering long-term strategy assessment in multi-agent settings, network-based evaluation, and real-world applications such as epidemic modeling and intelligent traffic control.
By bridging theory, methods, and applications, this tutorial aims to provide a unified perspective on how to evaluate AI in the wild, equipping participants with practical tools and conceptual frameworks to support trustworthy AI development across disciplines.
The organization of the tutorial is as follows:
This opening section provides an overview of how AI evaluation has evolved from static, task-specific metrics (e.g., accuracy, precision, reward) to more dynamic and holistic approaches. It highlights critical limitations of traditional benchmarks—such as their inability to measure reasoning, generalization, and long-term robustness—and introduces the tutorial’s structure, which spans vision-based tasks, complex systems, and future evaluation paradigms.
This module explores the challenges and recent advances in evaluating AI within visual tasks. It covers standard vision metrics, limitations in long video understanding, and emerging interactive and multimodal evaluation techniques. Participants will learn about new paradigms such as Visual Turing Tests and event-level analysis that better reflect AI’s spatial-temporal reasoning and multimodal understanding.
This section delves into AI evaluation beyond perception tasks, focusing on decision-making and strategic adaptation in complex, dynamic environments. It introduces system-level evaluation frameworks in reinforcement learning and multi-agent settings, as well as network-based approaches for modeling social influence, epidemic control, and traffic systems. The goal is to assess AI’s stability, adaptability, and long-term reasoning under uncertainty.
The final section looks forward to emerging trends in trustworthy AI evaluation. It emphasizes the importance of capability-driven metrics, interactive assessment techniques that reflect real-world dynamics, and standardized multimodal evaluation protocols. Novel theoretical tools—such as Parrondo’s Paradox and graph-based optimization—are discussed to inspire new perspectives on how to evaluate intelligent systems in open and evolving environments.
📧kanghao.cheong@ntu.edu.sg
Associate Professor, School of Physical and Mathematical Sciences, Nanyang Technological University with a joint appointment with AI, College of Computing and Data Science (CCDS), Nanyang Technological University. Dr Cheong is on the World’s Top 2% Scientists – study by Stanford University/Elsevier (for both career-long and single-year, ranked 0.5% in Artificial Intelligence & Image Processing category). His research interests include AI in medical/healthcare, complexity science, evolutionary computation, and network science. He has published in the following journals such as PNAS, PRL, Nature Communications, Advanced Materials, IEEE TEVC, IEEE TCYB, IEEE TSMC, IEEE TFS, TNNLS, etc. He is currently serving on the Editorial Board of Frontiers in Human Neuroscience, Journal of Computational Science, Games and Scientific Reports. He has served as a reviewer for more than 20 journals, including Nature Communications, Nature Machine Intelligence, IEEE TPAMI, IEEE TAC, IEEE TCYB and Physical Review (PR) journals.
📧shiyu.hu@ntu.edu.sg
Research Fellow at the School of Physical and Mathematical Sciences, Nanyang Technological University. Her research focuses on computer vision, large language models, and multi-modal learning. She has published over 20 papers in top-tier journals and conferences, such as TPAMI, IJCV, and NeurIPS, and received the Best Paper Honorable Mention at the CVPR VDU Workshop. She has developed widely used open-source platforms, including VideoCube, SOTVerse, and BioDrone, which have gained global recognition from 130+ countries and regions. Additionally, she has delivered tutorials at AI-related conferences, such as ICIP, ICPR, and ACCV, and authored an English monograph on artificial intelligence. She also serves as a reviewer and program committee member for CVPR, ICCV, ECCV, NeurIPS, ICML, ICLR, and TIP.
📧jie.zhao@ntu.edu.sg
Research Fellow, School of Physical and Mathematical Sciences, Nanyang Technological University. His research interests include network science, computational intelligence, and evolutionary computation and has more than 20 papers published in top-tier journals like IEEE TEVC, IEEE TSMC, IEEE TFS, etc.
📧yongbao.wu@ntu.edu.sg
Research Fellow, School of Physical and Mathematical Sciences, Nanyang Technological University. His current research interests include stability theory for stochastic differential equations, networked control systems, and network system attacks. He has published over 30 research papers in top international journals and conferences, such as Automatica, IEEE TSMC, IEEE TCNS, IEEE TFS, etc.
Half-day. October 25-30, 2025. Bologna, Italy. (Detailed timetable and location will be updated soon)
This tutorial is designed for PhD students, researchers, and industry professionals interested in AI evaluation. (1) PhD students will gain foundational knowledge beyond accuracy-based metrics, helping them understand key research directions in trustworthy AI assessment. (2) Experienced researchers will explore adaptive AI evaluation and dynamic performance measurement, essential for assessing AI’s long-term stability and generalization. (3) Industry professionals in autonomous driving, medical AI, and finance will learn how evaluation frameworks enhance AI reliability in real-world applications. Participants should have (1) basic knowledge of machine learning and deep learning, (2) familiarity with computer vision or complex decision-making systems, and (3) some probability background (helpful but not required).
Slides and related materials will be added here.