Title: Large Language Model Hallucination Benchmarking Research with MSRA
Abstract: Large language models (LLMs) have achieved unprecedented performances in various applications, yet evaluating them is still challenging. Existing benchmarks are either manually constructed or are automatic, but lack the ability to evaluate the thought process of LLMs with arbitrary complexity. We contend that utilizing existing relational databases based on the entity-relationship (ER) model is a promising approach for constructing benchmarks as they contain structured knowledge that can be used to question LLMs. Unlike knowledge bases, which are also used to evaluate LLMs, relational databases have integrity constraints that can be used to better construct complex in-depth questions and verify answers: (1) functional dependencies can be used to pinpoint critical keywords that an LLM must know to properly answer a given question containing certain attribute values; and (2) foreign key constraints can be used to join relations and construct multi-hop questions, which can be arbitrarily long and used to debug intermediate answers. We thus propose ERBench, which uses these integrity constraints to convert any database into an LLM benchmark. ERBench supports continuous evaluation as databases change, multimodal questions, and various prompt engineering techniques. In our experiments, we construct LLM benchmarks using databases of multiple domains and make an extensive comparison of contemporary LLMs. We show how ERBench can properly evaluate any LLM by not only checking for answer correctness, but also effectively verifying the rationales by looking for the right keywords. This work was published in NeurIPS 2024 (Spotlight) in collaboration with MSRA, and I will also introduce an ongoing extension of ERBench to evaluate factual time-sensitive question-answering in LLMs using temporal databases along with other collaboration results.
Bio: Steven Euijong Whang is an associate professor with tenure at KAIST EE and AI and leads the Data Intelligence Lab. His research interests include Responsible AI and Data-centric AI. He is an Associate Editor of IEEE TKDE (2023-2025) and VLDB 2025, and an Area Chair of ICLR 2025. Previously he was a Research Scientist at Google Research and co-developed the data infrastructure of the TensorFlow Extended (TFX) machine learning platform. Steven earned his PhD in Computer Science in 2012 from Stanford University. He is a Y-KAST (Young Korean Academy of Science and Technology) member, was a Kwon Oh-Hyun Endowed Chair Professor (2020-2023), and received a Google AI Focused Research Award (2018, the first in Asia). Homepage: https://stevenwhang.com
Title: Diagnose, Debug, and Enhance: A Three-Stage Framework for AI Compositional Reasoning
Abstract: Large language models still lag behind humans on abstract reasoning tasks—especially in composing multiple operations—despite impressive pattern-matching abilities. In this talk, I introduce a three-stage framework to tackle this bottleneck. First, we diagnose core weaknesses via a process-centric evaluation of ARC tasks, measuring logical coherence, compositionality, and productivity. Second, we debug the compositionality failure by converting ARC problems into multiple-choice questions (MC-LARC), revealing precise failure modes in the “Understand” and “Apply” stages. Finally, we enhance compositional reasoning with GIFARC, which injects human-intuitive analogies extracted from GIFs to guide models through “analogy → grid composition” steps. Together, these steps—diagnose → debug → enhance—form a cohesive pipeline toward more human-like, explainable, and aligned AI reasoning.
Short Bio: Sundong Kim is an Assistant Professor at the AI Graduate School, Gwangju Institute of Science and Technology (GIST) since 2022, where he leads the lab with a research focus on exploring diverse data science methodologies in the pursuit of developing general intelligence. He earned his Ph.D. in Knowledge Service Engineering from KAIST—during which he was a Research Intern at Microsoft Research Asia—and subsequently served as a Young Scientist Fellow at the Institute for Basic Science before joining GIST.
Title: Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights
Abstract: The application scope of Large Language Models (LLMs) continues to expand, leading to increasing interest in personalized LLMs that align with human values. However, aligning these models with individual values raises significant safety concerns, as certain values may correlate with harmful information. In this talk, we identify specific safety risks associated with value-aligned LLMs and investigate the psychological principles behind these challenges. Our findings reveal two key insights. (1) Value-aligned LLMs are more prone to harmful behavior compared to non-fine-tuned models and exhibit slightly higher risks in traditional safety evaluations than other fine-tuned models. (2) These safety issues arise because value-aligned LLMs genuinely internalize and act according to the aligned values, which can amplify harmful outcomes. Using a dataset with detailed safety categories, we find significant correlations between value alignment and safety risks, supported by psychological hypotheses. This study offers insights into the "black box" of value alignment and proposes in-context alignment methods to enhance the safety of value-aligned LLMs.
Bio: JinYeong Bak is an associate professor in the Colleague of Computing at Sungkyunkwan University. His research interests include analyzing human conversational behaviors and building machine learning models from the insights of the analysis. He worked at Microsoft Research Asia as a research intern and United Nations Global Pulse Lab Jakarta as a junior data scientist. He holds a Ph.D. and M.S. from the KAIST and a B.S. from Sungkyunkwan University. His research has been published in ACL, EMNLP, NAACL, ICML, CHI, and WWW. His personal homepage: https://nosyu.kr and lab homepage: https://hli.skku.edu
Title: Value Compass Benchmarks: Towards Comprehensive, Generative and Self-Evolving Evaluation of LLMs' Value Alignment
Abstract: As LLM-based generative models become increasingly integrated into human life, it's essential to assess their potential risks and societal impacts. Beyond risk-specific benchmarks, evaluating the value orientations reflected in LLMs offers a holistic lens for diagnosing potential misalignments and understanding how they align with the preferences of diverse user groups. However, value evaluation faces validity and informativeness challenges: how to ensure that assessments accurately capture an LLM’s underlying values and yield insightful and informative results. To address these challenges, we propose Generative Self-Evolving Evaluation, which leverages LLMs’ generative capacity and Psychometrics theory to dynamically and adaptively probe their value boundaries. Our method automatically generates novel, value-evoking items to avoid data contamination and ceiling effects, enabling a more faithful investigation of models' values. Building on this framework, we present the Value Compass Benchmarks, an online leaderboard offering a comprehensive analysis of the value orientations of 33+ popular LLMs.
Bio: Xiaoyuan Yi, Senior Researcher at Microsoft Research Asia. He obtained his bachelor’s and doctorate degrees in computer science from Tsinghua University and mainly engaged in Natural Language Generation (NLG) and Societal AI research. He led the development of one of the most famous AI poetry generation systems in China, which has millions of users from 100+ countries. He published 30+ papers at top-tier AI venues and won honor such as the Tsinghua University Supreme Scholarship, the Xinhua Net The 10 Most Influential People on the Internet, Best Paper Award and Best Demon Awards of the Chinese Conference on Computational Linguistics, Rising Star Award of IJCAI Young Elite Symposium, Rising Stars in Social Computing by The Chinese Association for Artificial Intelligence and so on.
Title: From Universal Value Alignment to Customized Alignment for Large Language Models
Abstract: As Large Language Models (LLMs) more deeply integrate into human life, aligning them with universal values like helpfulness, harmlessness and honesty become insufficient to satisfy diverse users across cultures and communities. Therefore, it is crucial to customize the alignment of LLMs for improving user experience and mitigating social conflicts. Despite considerable advancements in recent years, there lacks a clear discussion about what goals we should customize LLM alignment and what key challenges lie in this field. To bridge this gap, we made a comprehensive survey to figure out this task and shed light on the inherent challenges. Along this direction, we first delve into cultural alignment and address the data challenges. Existing approaches for cultural alignment faced two key challenges. (1) Representativeness: They fail to fully capture the target culture's core characteristics with redundancy, causing computation waste; (2) Distinctiveness: They struggle to distinguish the unique nuances of a given culture from shared patterns across other relevant ones, hindering precise cultural modeling. To handle these challenges, we introduce a novel cultural data construction framework. Extensive experiments demonstrate that our method generates more effective data and enables cultural alignment with as few as 100 training samples, enhancing both performance and efficiency.
Bio: Jing Yao is now a researcher at Social Computing Group in Microsoft Research Asia. She received her M.S. degree in Computer Science from Renmin University of China in 2022, and a B.S. degree in Computer Science from Renmin University of China in 2019. She joined MSRA in July 2022. Her research interests include responsible AI, large language model alignment, trustworthy recommendation and information retrieval. She has published some academic papers on top-tier international conferences such as Neurips, ACL, SIGIR, WWW, NAACL, CIKM. She serves as a program committee member for several conferences such as Neurips, ICLR, ACL and SIGIR.