[#336] 2026-02-24 [CVPR 2026] MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models (by Sangyun Chun) is accepted to CVPR 2026.
Title: MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models
Sangyun Chung, Se Yeon Kim, Youngchae Chee, and Yong Man Ro
Multimodal Large Language Models (MLLMs) suffer from cross-modal hallucinations, where one modality inappropriately influences generation about another, leading to fabricated output. This exposes a more fundamental deficiency in modality-interaction control. To address this, we propose Modality-Adaptive Decoding (MAD), a trainingfree method that adaptively weights modality-specific decoding branches based on task requirements. MAD leverages the model’s inherent ability to self-assess modality relevance by querying which modalities are needed for each task. The extracted modality probabilities are then used to adaptively weight contrastive decoding branches, enabling the model to focus on relevant information while suppressing cross-modal interference. Extensive experiments on CMM and AVHBench demonstrate that MAD significantly reduces cross-modal hallucinations across multiple audio-visual language models (7.8% and 2.0% improvements for VideoLLaMA2-AV, 8.7% and 4.7% improvements for Qwen2.5-Omni). Our approach demonstrates that explicit modality awareness through self-assessment is crucial for robust multimodal reasoning, offering a principled extension to existing contrastive decoding methods. Our code is available at https://github.com/top-yun/MAD
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), CVPR 2026
[#335] 2026-02-24 [CVPR 2026] Recursive Think-Answer Process for LLMs and VLMs (by Byung-Kwan Lee & Youngchae Chee) is accepted to CVPR 2026 Findings.
Title: Recursive Think-Answer Process for LLMs and VLMs
Byung-Kwan Lee, Youngchae Chee, Yong Man Ro
Think–Answer reasoners such as DeepSeek-R1 have made notable progress by leveraging interpretable internal reasoning. However, despite the frequent presence of self reflective cues like “Oops!”, they remain vulnerable to output errors during single-pass inference. To address this limitation, we propose an efficient Recursive Think–Answer Process (R-TAP) that enables models to engage in iterative reasoning cycles and generate more accurate answers, going beyond conventional single-pass approaches. Central to this approach is a confidence generator that evaluates the certainty of model responses and guides subsequent improvements. By incorporating two complementary rewards—Recursively Confidence Increase Reward and Final Answer Confidence Reward—we show that R-TAP-enhanced models consistently outperform conventional single-pass methods for both large language models (LLMs) and vision-language models (VLMs). Moreover, by analyzing the frequency of “Oops”-like xpressions in model responses, we find that R-TAP–applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning. We hope R-TAP pave the way evolving into efficient and elaborated methods to refine the reasoning processes of future AI.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), CVPR 2026
[#334] 2026-02-24 [CVPR 2026] ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding (by Hosu Lee) is accepted to CVPR 2026 Findings.
Title: ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding
Hosu Lee, Junho Kim, Hyunjun Kim, Yong Man Ro
Recent progress in Large Multi-modal Models (LMMs) has enabled effective vision-language reasoning, yet the ability to video understanding remains constrained by suboptimal frame selection strategies, albeit with the rapid development of video-specialized LMMs. Prior works attempted to solve this with static heuristics or external retrieval modules to feed frame-level information, but these approaches often fail to capture visual cues grounded to the given user queries conflating raw visual dynamics with true semantic relevance. In this paper, we introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), the first framework to integrate online policy-gradient reinforcement learning into frame-level optimization for video-LLMs. ReFoCUS aims to learn a frame selection policy, leveraging reward signals derived from reference models to capture their underlying scoring behavior over frame combinations that best support temporally grounded responses. To efficiently explore the large combinatorial frame space, we employ an autoregressive and query-conditional selection architecture that ensures con textual consistency while reducing complexity. Our policy learning removes the need for explicit frame-level super vision, as it implicitly discovers optimal and semantically consistent frame compositions. ReFoCUS consistently improves reasoning accuracy across multiple video QA benchmarks, demonstrating the advantage of aligning frame selection with model-internal utility.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), CVPR 2026
[#333] 2026-02-20 [Appointed as Assistant Professor] Dr. Se Jin Park (Advisor: Prof. Yong Man Ro) joins Kyung Hee University as Assistant Professor in the Department of Electronic Engineering.
Se Jin Park, a doctoral researcher from Professor Yong Man Ro’s laboratory, received his Ph.D. in February 2026 and has been appointed as an Assistant Professor in the Department of Electronic Engineering at Kyung Hee University as of March 2026. Throughout her doctoral studies, Park has been conducting research on multimodal artificial intelligence that integrates speech, vision, and language, with the goal of enabling natural and seamless interaction between humans and AI.
Park has been developing methods for visual–acoustic representation learning, modeling long- and short-term conversational context, and leveraging both linguistic and nonverbal cues from human interaction for dialogue understanding and generation. Her research achievements have been recognized internationally. She has presented a total of 13 papers at top-tier conferences such as ICML, ACL, CVPR, AAAI, and ICASSP, and her work has been selected for several prestigious distinctions, including the ACL Outstanding Paper Award, ICML Oral, CVPR Highlight, ACL Oral, and AAAI Oral. Through these accomplishments, Park has established herself as a competitive researcher in the fields of multimodal AI and conversational intelligence.
Park has expressed her intention to continue pursuing research on conversational intelligence that enables AI systems to collaborate and communicate effectively with real users in complex interaction environments that combine speech, vision, and language. Our school sincerely congratulates her on this new beginning and looks forward to her future contributions in education, research, and industry collaboration at Kyung Hee University.
[#332] 2026-02-04 [Appointed as Assistant Professor] Dr. Hong Joo Lee (Advisor: Prof. Yong Man Ro) appointed as Assistant Professor at Seoul National University of Science and Technology.
Title: Dr. Hong Joo Lee (Advisor: Prof. Yong Man Ro) appointed as Assistant Professor at Seoul National University of Science and Technology.
Dr. Hong Joo Lee, an alumnus of the School of Electrical Engineering at KAIST (Advisor: Prof. Yong Man Ro), has been appointed as an Assistant Professor in the Department of Applied Artificial Intelligence at Seoul National University of Science and Technology, effective March 1, 2026.
Dr. Lee earned his Ph.D. with a dissertation titled "Investigating Adversarial Robustness via Booster Signal." During his doctoral studies, he participated in the Center for Applied Research in Artificial Intelligence (CARAI) for National Defense Research. His research has been widely recognized through numerous publications in top-tier conferences and journals, including CVPR, IEEE TIP, and IEEE TNNLS.
After receiving his doctorate in 2023, Dr. Lee served as a postdoctoral researcher at the Technical University of Munich (TUM) in Germany. His postdoctoral work focused on the reliability of AI models in the medical field, leading to further high-impact publications in ECCV, MICCAI, and AAAI.
In his new role as a professor, Dr. Lee plans to deepen his research on Reliable Intelligence Systems, focusing on the vulnerability, safety, and fairness of AI models.
[#331] 2026-01-18 [ICASSP 2026] Robust Grounding with MLLMs against Occlusion and Small Objects via Language-Guided Semantic Cues (by Beomchan Park & Seongho Kim) is accepted to ICASSP 2026.
Title: Robust Grounding with MLLMs against Occlusion and Small Objects via Language-Guided Semantic Cues
Beomchan Park*, Seongho Kim*, Hyunjun Kim, Sungjune Park, Yong Man Ro (*equal contribution)
While Multimodal Large Language Models (MLLMs) have enhanced grounding capabilities in general scenes, their robustness in crowded scenes remains underexplored. Crowded scenes entail visual challenges (i.e., occlusion and small objects), which impair object semantics and degrade grounding performance. In contrast, language expressions are immune to such degradation and preserve object semantics. In light of these observations, we propose a novel method that overcomes such constraints by leveraging Language-Guided Semantic Cues (LGSCs). Specifically, our approach introduces a Semantic Cue Extractor (SCE) to derive semantic cues of objects from the visual pipeline of an MLLM. We then guide these cues using corresponding text embeddings to produce LGSCs as linguistic semantic priors. Subsequently, they are reintegrated into the original visual pipeline to refine object semantics. Extensive experiments and analyses demonstrate that incorporating LGSCs into an MLLM effectively improves grounding accuracy in crowded scenes.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), ICASSP 2026