Associate Professor, Keio University
Title: How to Design Benchmarks for Augmented Society: Based on 20 Years of Trends in Human-Agent Interaction Research
Abstract: Human-Agent Interaction (HAI) is a methodology for augmenting human society by introducing social agents that behave with intent derived from humans. With the rapid recent advancements in generative AI, HAI is increasingly converging with AI agent research. Benchmarks in Human-Agent Interaction must evaluate not only algorithms and devices, but also the impact of individual agents' behaviors and their interpersonal relationships on human society. The speaker serves as General Chair for the HAI 2025, and is a HAI Steering Committee Chair.
From this vantage point, this talk will trace the development of HAI research over the past two decades and explore potential benchmarks for current HAI studies. Additionally, the speaker will present research findings on how dual-agent relationships influence humans, based on their own ongoing work.
Bio: Dr. Hirotaka Osawa is an associate professor at Keio University and a visiting associate professor at the University of Tsukuba. His research field includes human-agent interaction, the development of anthropomorphic devices, simulation using social games for social agents, and exploring humanity studies through science fiction literature. He focuses specifically on how human-like appearance and attitude enhance the interaction between users and machines, as well as examining the role of social intelligence in improving our society.
Dr. Osawa earned his PhD in Engineering, a Master’s degree and a Bachelor’s degree in Computer Science from Keio University.
Head of Frontier AI Research, ServiceNow
Title: Agentic Full-Stack Benchmarking for Knowledge Work
Abstract: In less than a year, AI agents have evolved from a research curiosity into the foundation of some of the largest software platform updates in decades. These systems promise to automate substantial portions of knowledge work, and their progress has been rapid, with early 2025 reports by METR suggesting that the complexity of solvable tasks doubles roughly every seven months. In this talk, we take a closer empirical look at this claim by examining what it truly takes to benchmark agentic performance on long-running, open-ended knowledge work tasks. We review recent contributions from ServiceNow Research and others across domains such as browser use, multimodal understanding, data analytics, and deep research. We also discuss benchmarks that evaluate agentic safety and security, arguing that these dimensions cannot be meaningfully separated from primary task performance. Our analysis leads to a more nuanced picture of the field, highlighting both genuine advances and persistent challenges that frontier agents have yet to overcome.
Bio: Dr. Alexandre Drouin leads the Frontier AI Research group at ServiceNow Research and is an Adjunct Professor of Computer Science at Laval University and Mila. His work focuses on machine learning for decision-making in complex, dynamic environments, with emphasis on causal inference, probabilistic time series forecasting, and, more recently, LLM-based agents for automating decisions in these contexts. His recent contributions include benchmarks and frameworks for developing agents with capabilities in browser automation, data analytics, and forecasting, as well as assessing their security and robustness.
Project Lead, IBM's Watson Research Center
Title: Small Language Models for Enterprise Agentic Workflows
Abstract: Large language models dominate current discussions on agentic AI, but their cost and complexity often limit enterprise adoption. Our research explores how small language models can be equipped with function calling, reasoning, and planning capabilities to perform effectively in enterprise workflows. Agentic AI enables systems to reason over complex states, call tools, and plan multi-step actions. I will present methods for aligning small models with these tasks, leveraging synthetic data for adaptation, and evaluating their reliability in real-world automation settings. The goal is to show how efficient models can deliver trustworthy, enterprise-grade agentic AI solutions.
While our design principles are broadly applicable, I will focus on IT automation—drawing on IT Bench from IBM, BFCL v4, and other agentic benchmarks. IT automation is critical because enterprises depend on reliable, scalable, and secure IT services, and maintaining this reliability requires continuous monitoring, rapid incident response, and efficient remediation.
Bio: Dr. Asim Munawar is a Project Lead at IBM’s Watson Research Center in New York, where he heads efforts to enhance reasoning, planning, and agentic workflows in enterprise-scale large language models. With over 15 years of experience in AI—more than a decade of it at IBM Research—he has held key leadership roles, including Manager and Program Director for Neuro-Symbolic AI.
Dr. Munawar earned his Ph.D. from Hokkaido University, Japan, and has authored over 80 peer-reviewed publications. He is an inventor on 20+ U.S. patents and a frequent keynote and invited speaker at top venues such as IJCAI, ICSE, and ACMSE. He also serves on advisory boards for the National Center of Artificial Intelligence in Pakistan and the Centaur AI Institute in the U.S.
His work focuses on building scalable, high-impact AI systems and fostering strong, diverse teams. He continues to advance the capabilities of AI for solving real-world enterprise challenges.
Professor, Keio University
Title: Multimodal AI for Sustainable Progress
Abstract: Recent advances in multimodal large language models (MLLMs) that integrate vision, language, and other modalities are transforming a wide range of societal applications. This talk will introduce the fundamentals of MLLMs and examine their capabilities and limitations across diverse tasks, followed by a discussion of benchmarks for their evaluation. Despite the substantial human and financial resources devoted to the development of MLLMs, realistic benchmarks that fully capture the complexity of multimodal understanding and generation remain scarce. I will present our recent work on MLLM applications, together with new benchmarks designed for their evaluation, and conclude by outlining future directions and challenges for advancing multimodal AI.
Bio: Dr. Komei Sugiura is Professor at Keio University, Japan. He received his B.E. in Electrical and Electronic Engineering in 2002, and his M.S. and Ph.D. in Informatics in 2004 and 2007, respectively, all from Kyoto University. From 2006 to 2008, he was Research Fellow of the Japan Society for the Promotion of Science, and from 2006 to 2009, he was also with ATR Spoken Language Communication Research Laboratories. From 2008 to 2020, he was with the National Institute of Information and Communications Technology, Japan, before joining Keio University in 2020. His research interests include multimodal language understanding, service robots, machine learning, spoken dialogue systems, cloud robotics, imitation learning, and recommender systems.
Senior Project Director, Fujitsu Limited
Title: Agentic AI Benchmark in Enterprise
Abstract: In recent years, AI agents centered on generative AI have been rapidly gaining traction. The expansion of these AI agents goes beyond mere content generation - by automating task planning, external tool use, data collection, and result integration, they are extending AI utilization into areas of decision-making and creative work that were once the domain of humans. As a result, a more flexible and autonomous form of human–AI collaboration is emerging across diverse business and social contexts.
At the same time, despite the growing scope of AI agent applications, the standards for evaluating their validity and reliability are not well established yet. In enterprise environments in particular, AI agents make decisions and take actions while collaborating with multiple business systems and human users, making evaluation especially challenging. There is a growing need for quantitative assessments from multiple perspectives - including agent collaboration, task success rate, and the agent’s ability to improve through self-learning and memory.
This presentation introduces the current status and challenges of applying AI agents in enterprise settings, using Fujitsu Kozuchi AI Agent as an example. In particular, it discusses the practical difficulties of developing “domain-specific agents” revealed through real-world applications such as field work support and data analysis tasks, and outlines benchmark approaches that enable their quantitative evaluation.
Bio: Dr. Hiro Kobashi is Senior Project Director of Artificial Intelligence Laboratory at Fujitsu Research where he leads teams of researchers in the United States, India and Japan working on AI Agent Technologies to realize sustainable and efficient AI agents. He joined Fujitsu in 2003 and has worked at Fujitsu research organizations both in Japan and United Kingdom. His research interests include artificial intelligence, machine learning, and distributed systems.
Assistant Professor, University of Illinois Urbana-Champaign (UIUC)
Title: Establishing Best Practices for Building Rigorous Agentic Benchmarks
Abstract: Benchmarks are essential for quantitatively tracking progress in AI. As AI agents become increasingly capable, researchers and practitioners have introduced agentic benchmarks to evaluate agents on complex, real-world tasks. These benchmarks typically measure agent capabilities by evaluating task outcomes via specific reward designs. However, we show that many agentic benchmarks have issues in task setup or reward design. For example, SWE-bench Verified uses insufficient test cases, while TAU-bench counts empty responses as successful. Such issues can lead to under- or overestimation of agents' performance by up to 100% in relative terms. To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues. When applied to CVE-Bench, a benchmark with a particularly complex evaluation design, ABC reduces the performance overestimation by 33%.
Bio: Dr. Daniel Kang is an assistant professor at UIUC in the computer science department and in ECE by courtesy, and a research scientist at Bridgewater AIA Labs. His work focuses on building rigorous AI agent benchmark, with an emphasis on secure and robust deployment. As part of his work, he's created award-winning benchmarks (CVE-Bench) and award-winning standards for how to create high quality AI agent benchmarks (Agentic Benchmark Checklist). Daniel's research is and has been supported by Google, the Open Philanthropy project, Emergent Ventures, Capital One, and others.
Applied Scientist, Amazon Research
Title: Temp.
Abstract:
Bio: