Shared task evaluation in the era of LLMs
Jin-Dong Kim, Database Center for Life Science (DBCLS), Japan
Abstract: The emergence of large language models (LLMs) and generative AI is reshaping the landscape of industry, academia, and everyday life. Shared task evaluation is no exception to this transformation. Since 2024, the LLM-as-judge paradigm has garnered increasing attention and become a focal point of research. An expanding body of work is now investigating the potential of LLMs to serve as evaluators, providing new avenues for assessing system outputs beyond conventional human-based methods.
This talk will discuss key considerations in employing LLMs for evaluation—such as reliability, bias, and calibration—and illustrate how tasks such as Hidden-Rad can benefit from this emerging paradigm. The presentation will also reflect on the implications of LLM-assisted evaluation for the future of shared tasks and benchmark development, highlighting opportunities for more robust, scalable, and transparent evaluation frameworks.
Bio: Project Associate Professor at the Database Center for Life Science (DBCLS), Japan. Expert in biomedical text mining and scientific information extraction with extensive experience organizing shared tasks for biomedical NLP. His work on structured
knowledge extraction complements the workshop’s focus on causal reasoning in specialized domains. https://data.dbcls.jp/~jdkim/
Building a Domain Expert Financial LLM: A Case Study in Korean Knowledge Acquisition and Validation
Young-Gyun Hahm, Teddysum Inc., Republic of Korea
Abstract: The development of successful domain-specific Large Language Models (LLMs) critically depends on leveraging proprietary and specialized knowledge to enhance the performance of general-purpose models. This is particularly crucial for non-English models, where the pre-training process must incorporate not only linguistic proficiency but also domain-specific knowledge tied to that language, such as local legal frameworks and regulatory documents.
This invited talk presents bllossom, a specialized Korean LLM, and details the comprehensive lifecycle employed to instill and evaluate financial domain expertise within the model. We will first introduce bllossom and underscore the unique challenges of acquiring, processing, and integrating high-quality financial knowledge specific to the Korean context.
Bio: AI researcher at Teddysum Inc., focusing on developing complex reasoning systems with LLMs. Led the winning team for NTCIR-18 HIDDEN-RAD Task 2, implementing innovative approaches combining Chain-of-Thought, RAG, and Tree-of-Thought. Experienced in organizing technical workshops and challenge tasks. https://www.teddysum.ai/
Generating and Self‑Verifying Radiology Findings and Causal Reasoning
Mercy Ranjit, Microsoft Research India, India
Abstract: Radiology reports require both accurate findings and structured reasoning that link observations to diagnostic possibilities. We define causal exploration as expanding findings into structured reasoning that incorporates radiology knowledge, differential diagnosis, and clinical context to support better clinical diagnosis. Rad‑Phi4‑Vision‑CXR is a model that unifies findings generation and causal exploration capabilities. A self‑verification workflow critiques and revises both descriptive and reasoning sections: findings are evaluated against RADPEER guidelines for error detection and audit alignment, while reasoning is refined through iterative verification to ensure coherence and clinical soundness. Together, these advances move AI beyond surface‑level text generation toward radiology reports that are clinically trustworthy, auditable, and educational.
Bio: Mercy Ranjit is a Principal Research ML Engineer at Microsoft Research India, where she leads work on multimodal small language models for radiology. Her research combines imaging, clinical context, and structured reasoning to develop scalable solutions that improve diagnostic accuracy and support real‑world clinical deployment.
Retrieval–Reasoning Enhanced Generation for Radiology Reports: Experience from the NTCIR-18 Hidden-RAD Task
Seung-Hoon Na, UNIST, Republic of Korea
Abstract: This talk presents the experience of our team in the NTCIR-18 Hidden-RAD Task, which focused on generating causality-based diagnostic inferences from radiology reports. In Subtask 1, we developed a cost-efficient API-driven inference pipeline that integrates few-shot in-context learning, retrieval-enhanced prompting, and strict candidate selection with an evaluation checklist. By dynamically enriching prompts with retrieved similar cases, this approach achieved 1st place in the official evaluation. In Subtask 2, we introduced PRISMA-Guided Causal Explanation, a structured prompt-based reasoning method that improved interpretability and secured 2nd place. We also explored fine-tuning with domain-specific prompting, which, while not included in the final ranking, demonstrated promise for improving adaptability and interpretability.
Building on these results, the talk will further explore advances toward reasoning-enhanced methods and test-time adaptation, including dynamic retrieval strategies, hybrid symbolic–neural reasoning frameworks, and lightweight inference-time tuning. These approaches aim to strengthen explainable AI in radiology, bridging the gap between automated diagnostic inference and human expert decision-making.
Bio: Associate professor at the Artificial Intelligence Graduate School and the Department of Computer Science & Engineering at UNIST since 2025, where he leads the Natural Language Processing Lab. Before joining UNIST, he was a tenured full professor at the Department of Computer Science & Artificial Intelligence at Jeonbuk National University, where he had served since 2015. Prior to that, he was a senior researcher at ETRI, after working as a research fellow at the School of Computing, National University of Singapore. He received his Ph.D. in Computer Science from POSTECH in 2008 under the supervision of Prof. Jong-Hyeok Lee. Before that, he earned his M.S. in Computer Science from POSTECH in 2003 and his B.S. in Information and Computer Science from Ajou University in 2001. Currently, he serves as a Member-at-Large (MAL) of the Asian Federation of Natural Language Processing (AFNLP) and a standing reviewer for Computational Linguistics. He also served as a publication co-chair for COLING 2022 and the chair of the Special Interest Group on Human and Cognitive Language Technology in the Republic of Korea. His research interests include natural language processing, information retrieval, and machine learning. https://nlp.unist.ac.kr/faculty.html