Poster Session 2

(12/12, Fri, 16:45~18:00)

[Part 2] 12/12(Fri.) 16:45~13:30

[2-1] Task Vector Quantization for Memory-Efficient Model Merging (ICCV 2025)
이승환(석사과정)
Model merging enables efficient multi-task models by combining task-specific fine-tuned checkpoints. However, storing multiple task-specific checkpoints requires significant memory, limiting scalability and restricting model merging to larger models and diverse tasks. In this paper, we propose quantizing task vectors (i.e., the difference between pre-trained and fine-tuned checkpoints) instead of quantizing fine-tuned checkpoints. We observe that task vectors exhibit a narrow weight range, enabling lowprecision quantization (≤ 4 bit) within existing task vector merging frameworks. To further mitigate quantization errors within ultra-low bit precision (e.g., 2 bit), we introduce Residual Task Vector Quantization, which decomposes the task vector into a base vector and offset component. We allocate bits based on quantization sensitivity, ensuring precision while minimizing error within a memory budget. Experiments on image classification and dense prediction show our method maintains or improves model merging performance while using only 8% of the memory required for full-precision checkpoints. Our code is available at https://aim-skku.github.io/TVQ/.

[2-2] Harnessing Influence Function in Explaining Graph Neural Networks (KDD 2025)
김찬용(석사과정)
Explaining graphs and their target Graph Neural Networks (GNNs) has gained attention with the growing use of GNNs. Most existing explainable AI (XAI) methods for GNNs focus on extracting an explanation subgraph and assume the target GNN is supervised with accessible class probabilities. However, the growing prevalence of GNN models in unsupervised settings underscores the necessity for task-irrelevant explanations. Moreover, most existing studies scarcely explore whether identifying edges absent from the original graph can improve explanation quality. To this end, we propose HINT-G (Harnessing INfluence function for Task-irrelevant explanation on Graph neural networks), a method that uses influence functions to explain models across diverse learning paradigms and considers edges beyond the given graph. The influence of an edge can be determined directly or by aggregating the influence scores of its constituent nodes, while the influence of a non-existent edge can also be determined. Furthermore, this method is task-irrelevant, since the influence score can be obtained whenever the loss function of the target model is differentiable. Experimental results on several datasets consistently demonstrate that HINT-G effectively explains graphs through the influence function framework.

[2-3] Bridging the Semantic Granularity Gap Between Text and Frame Representations for Partially Relevant Video Retrieval (AAAI 2025)
전우진(박사과정)
Partially Relevant Video Retrieval (PRVR) addresses the challenges of text-to-video retrieval in real-world scenarios where untrimmed videos are prevalent. Traditional PRVR methods encode videos at two feature scales: (1) frame-level to capture fine details, and (2) clip-level to recognize broader content. However, these approaches align both scales with a single sentence representation, leading to suboptimal performance. In particular, we point out the level mismatch in aligning frame-level video features with a sentence representation, as the entire meaning of a sentence contains broader and more diverse content than what frame-level features can encode. This misalignment causes frame-level features to capture broader contexts and overlook local fine details. To tackle this issue, we propose a framework that represents a sentence as a set of multiple components, where each component aligns with frame-level semantics. Specifically, we introduce Semantic-Decomposed Matching (SDM) to adjust the granularity of the text description to match them with frame-level video features. In addition to the matching process, we develop the Adaptive Local Aggregator (ALA) to enhance video encoding in capturing finer local details, ensuring precise text-video alignment at the frame level. ALA adaptively integrates multi-scale local details within short temporal spans obtained by enforcing a strict temporal aggregation range. Finally, we reinforce detailed encoding at the frame level with newly designed objectives for both modalities. Extensive experiments integrating our framework with existing clip branches demonstrate its effectiveness and applicability, highlighting significant improvements in PRVR performance.

[2-4] BOVIS: Bias-Mitigated Object-Enhanced Visual Emotion Analysis (CIKM 2025
이유빈(석박통합과정)
Visual emotion analysis is a promising field that aims to predict emotional responses elicited by visual stimuli. While recent advances in deep learning have significantly improved emotion detection capabilities, existing methods often fall short because of their exclusive focus on either holistic visual features or semantic content, thereby neglecting their interplay. To address this limitation, we introduce BOVIS, a Bias-Mitigated Object-Enhanced Visual Emotion Analysis framework. To capture the subtle relationships between visual and semantic features and enrich the understanding of emotional contexts, BOVIS leverages pre-trained models to extract comprehensive image features, integrate object-level semantics, and enhance contextual information. Moreover, BOVIS incorporates a bias mitigation strategy that involves an adjusted Mean Absolute Error loss function alongside an Inverse Probability Weighting method to address dataset imbalances and enhance fairness in emotion prediction. Comprehensive evaluations across various benchmark datasets demonstrate the effectiveness of the BOVIS framework in enhancing visual emotion analysis. The results reveal that the synergy between object-specific features and holistic visual representations improves the accuracy and interpretability of emotion analysis, while optimizing bias mitigation enhances fairness and increases reliability.

[2-5] Prediction of kidney function in deceased donor kidney transplant recipients (Journal of Nephrology)
Xiaohong Yu(박사과정)
This study addresses the growing demand for kidney transplantation amid a shortage of living donors by focusing on deceased donor kidneys, which are associated with shorter graft survival and higher risk of early allograft failure. We aim to non-invasively predict third-year post-transplant renal function, measured as estimated glomerular filtration rate (eGFR), using only pre-transplant donor and recipient variables. Several machine learning regression models were developed and evaluated using mean absolute error and standard deviation, with linear LASSO regression achieving the best performance (MAE = 17.017). In addition to providing low-error prediction of third-year eGFR after deceased donor kidney transplantation, the model enables identification of key donor and recipient risk factors that influence long-term graft function, offering potential support for pre-transplant risk stratification and clinical decision-making.

[2-6] Compile-Time QoS Scheme for Deep Learning Inferences (SC 2025)
홍성인(박사과정)
With the proliferation of deep learning technologies across various service domains, the sharing of accelerators such as GPUs, TPUs, and NPUs for inference processing has become increasingly common. These accelerators must efficiently handle multiple deep learning services operating concurrently. However, inference requests, characterized by sequences of short-duration kernels, create significant challenges for online schedulers attempting to maintain Quality of Service (QoS) guarantees. This paper presents QoSlicer, a novel compile-time QoS management framework that employs kernel slicing to relieve the burden on schedulers. By generating multiple pre-determined slicing plans, QoSlicer enables more efficient, lightweight QoS scheduling while ensuring target latency requirements are met. Our approach incorporates a heuristic search algorithm to identify optimal slicing plans and implements robust performance estimation models to validate these plans. Our experimental evaluation across 75 diverse workload combinations demonstrates that QoSlicer improves throughput by an average of 20.2\% compared to state-of-the-art scheduling techniques.

[2-7] Ambiguity-Restrained Text-Video Representation Learning for Partially Relevant Video Retrieval (AAAI 2025)
조철호(석박통합과정)
Partially Relevant Video Retrieval (PRVR) aims to retrieve a video where a specific segment is relevant to a given text query. Typical training processes of PRVR assume a one-to-one relationship where each text query is relevant to only one video. However, we point out the inherent ambiguity between text and video content based on their conceptual scope and propose a framework that incorporates this ambiguity into the model learning process. Specifically, we propose Ambiguity-Restrained representation Learning (ARL) to address ambiguous text-video pairs. Initially, ARL detects ambiguous pairs based on two criteria: uncertainty and similarity. Uncertainty represents whether instances include commonly shared context across the dataset, while similarity indicates pair-wise semantic overlap. Then, with the detected ambiguous pairs, our ARL hierarchically learns the semantic relationship via multi-positive contrastive learning and dual triplet margin loss. Additionally, we delve into fine-grained relationships within the video instances. Unlike typical training at the text-video level, where pairwise information is provided, we address the inherent ambiguity within frames of the same untrimmed video, which often contains multiple contexts. This allows us to further enhance learning at the text-frame level. Lastly, we propose cross-model ambiguity detection to mitigate the error propagation that occurs when a single model is employed to detect ambiguous pairs for its training. With all components combined, our proposed method demonstrates its effectiveness in PRVR.

[2-8] DIFF: Dual Side-Information Filtering and Fusion for Sequential Recommendation (SIGIR 2025)
김혜영(박사과정)
Side-information Integrated Sequential Recommendation (SISR) benefits from auxiliary item information to infer hidden user preferences, which is particularly effective for sparse interactions and cold-start scenarios. However, existing studies face two main challenges. (i) They fail to remove noisy signals in item sequence and (ii) they underutilize the potential of side-information integration. To tackle these issues, we propose a novel SISR model, Dual Side-Information Filtering and Fusion (DIFF), which employs frequency-based noise filtering and dual multi-sequence fusion. Specifically, we convert the item sequence to the frequency domain to filter out noisy short-term fluctuations in user interests. We then combine early and intermediate fusion to capture diverse relationships across item IDs and attributes. Thanks to our innovative filtering and fusion strategy, DIFF is more robust in learning subtle and complex item correlations in the sequence. DIFF outperforms state-of-the-art SISR models, achieving improvements of up to 14.1% and 12.5% in Recall@20 and NDCG@20 across four benchmark datasets.

[2-9] MerFT: A Framework for Social Conflict Meme Exploration via Multimodal Retrieval-Augmented Fine-tuning (WSDM 2026)
김기성(석박통합과정)
Social media widely circulates harmful and conflict-laden narratives, and internet memes are a key multimodal vehicle for such content. We present RoMQD, a multimodal dataset purpose-built for distractor-aware meme interpretation, and MerFT (Meme Exploration via Multimodal Retrieval-Augmented Fine-tuning), a training framework that integrates images, captions, and associated documents within RAG pipelines. MerFT couples citation-aware chainof- thought with a document-aligned loss to ground answers in oracle evidence while discounting semantically similar but misleading distractors.We evaluate MerFT under multiple input configurations (Base, Caption, Both) while systematically varying distractor frequency. The model shows graceful degradation as noise increases, with Both (image+caption) inputs yielding the most reliable behavior. On RoMQD, MerFT improves over strong RAG baselines (e.g., +8.1 F1 with Qwen2.5-VL) and delivers larger gains on categories requiring nuanced cultural grounding, such as satire/irony and image–text integration. A clustering-based strategy for constructing challenging distractor pools further enhances robustness, and MerFT remains complementary to modern rerankers. These results demonstrate the feasibility of retrieval-robust multimodal reasoning for meme-based socio-cultural conflict analysis and provide practical guidance for building dependable content analysis systems for policy, communication, and socio-political monitoring. Our code is available at https://anonymous.4open.science/r/MerFT-E8C2 and we release our dataset at this URL.

[2-10] GRAM: Generative Recommendation via Semantic-aware Multi-granular Late Fusion (ACL 2025)
이선경(박사과정)
Generative recommendation is an emerging paradigm that leverages the extensive knowledge of large language models by formulating recommendations into a text-to-text generation task. However, existing studies face two key limitations in (i) incorporating implicit item relationships and (ii) utilizing rich yet lengthy item information. To address these challenges, we propose a Generative Recommender via semantic-Aware Multi-granular late fusion (GRAM), introducing two synergistic innovations. First, we design semantic-to-lexical translation to encode implicit hierarchical and collaborative item relationships into the vocabulary space of LLMs. Second, we present multi-granular late fusion to integrate rich semantics efficiently with minimal information loss. It employs separate encoders for multi-granular prompts, delaying the fusion until the decoding stage. Experiments on four benchmark datasets show that GRAM outperforms eight state-of-the-art generative recommendation models, achieving significant improvements of 11.5-16.0% in Recall@5 and 5.3-13.6% in NDCG@5. The source code is available at https://github.com/skleee/GRAM.

[2-11] BD-Net: Has Depth-Wise Convolution Ever Been Applied in Binary Neural Networks? (AAAI 2026)
김도영(석박통합과정)
Recent advances in model compression have highlighted the potential of low-bit precision techniques, with Binary Neural Networks (BNNs) attracting attention for their extreme efficiency. However, extreme quantization in BNNs limits representational capacity and destabilizes training, posing significant challenges for lightweight architectures with depth-wise convolutions. To address this, we propose a 1.58-bit convolution to enhance expressiveness and a pre-BN residual connection to stabilize optimization by improving the Hessian condition number. These innovations enable, to the best of our knowledge, the first successful binarization of depth-wise convolutions in BNNs. Our method achieves 33M OPs on ImageNet with MobileNet V1, establishing a new state-of-theart in BNNs by outperforming prior methods with comparable OPs. Moreover, it consistently outperforms existing methods across various datasets, including CIFAR-10, CIFAR100, STL-10, Tiny ImageNet, and Oxford Flowers 102, with accuracy improvements of up to 9.3 percentage points.

[2-12] Efficient Multi-bit Quantization Network Training via Weight Bias Correction and Bit-wise Coreset Sampling (NeurIPS 2025)
안재준(석사과정)
Multi-bit quantization networks enable flexible deployment of deep neural networks by supporting multiple precision levels within a single model. However, existing approaches suffer from significant training overhead as full-dataset updates are repeated for each supported bit-width, resulting in a cost that scales linearly with the number of precisions. Additionally, extra fine-tuning stages are often required to support additional or intermediate precision options, further compounding the overall training burden. To address this issue, we propose two techniques that greatly reduce the training overhead without compromising model utility: (i) Weight bias correction enables shared batch normalization and eliminates the need for fine-tuning by neutralizing quantization-induced bias across bit-widths and aligning activation distributions; and (ii) Bit-wise coreset sampling strategy allows each child model to train on a compact, informative subset selected via gradient-based importance scores by exploiting the implicit knowledge transfer phenomenon. Experiments on CIFAR-10/100, TinyImageNet, and ImageNet-1K with both ResNet and ViT architectures demonstrate that our method achieves competitive or superior accuracy while reducing training time up to 7.88x.

[2-13] Diffusion Feature Field for Text-based 3D Editing with Gaussian Splatting (NeurIPS 2025)
고은서(박사과정)
Recent advances in text-based image editing have motivated the extension of these techniques into the 3D domain. However, existing methods typically apply 2D diffusion models independently to multiple viewpoints, resulting in significant artifacts, most notably the Janus problem, due to inconsistencies across edited views. To address this, we propose a novel approach termed DFFSplat, which integrates a 3D-consistent diffusion feature field into the editing pipeline. By rendering and injecting these 3D-consistent structural features into intermediate layers of a 2D diffusion model, our method effectively enforces geometric alignment and semantic coherence across views. However, averaging 3D features during the feature field learning process can lead to the loss of fine texture details. To overcome this, we introduce a dual-encoder architecture to disentangle view-independent structural information from view-dependent appearance details. By encoding only the disentangled structure into the 3D field and injecting it during 2D editing, our method produces semantically and multi-view coherent edited images while maintaining high text fidelity. Additionally, we employ a time-invariance objective to ensure consistency across diffusion timesteps, enhancing the stability of learned representations. Experimental results demonstrate that our method achieves state-of-the-art performance in terms of text-fidelity, and better preserves structural and semantic consistency compared to existing approaches.

[2-14] DeCAP: Context-Adaptive Prompt Generation for Debiasing Zero-shot Question Answering in Large Language Models (NAACL 2025)
배수영(박사과정)
While Large Language Models (LLMs) excel in zero-shot Question Answering (QA), they tend to expose biases in their internal knowledge when faced with socially sensitive questions, leading to a degradation in performance. Existing zero-shot methods are efficient but failto consider context and prevent bias propagation in the answers. To address this, we propose *DeCAP*, a method for debiasing LLMs usingContext-Adaptive Prompt Generation. *DeCAP* leverages a *Question Ambiguity Detection* to take appropriate debiasing actions based on the context and a *Neutral Answer Guidance Generation* to suppress the LLMs make objective judgments about the context, minimizing thepropagation of bias from their internal knowledge. Our various experiments across eight LLMs show that *DeCAP* achieves state-of-the-art zero-shot debiased QA performance. This demonstrates *DeCAP*’s efficacy in enhancing the fairness and accuracy of LLMs in diverseQA settings.

[2-15] LAMP: Implicit Language Map for Robot Navigation (ICRA2026)
이시백(석박통합과정)
Recent advances in vision-language models have made zero-shot navigation feasible, enabling robots to interpret and follow natural language instructions without requiring labeling. However, existing methods that explicitly store language vectors in grid or node-based maps struggle to scale to large environments due to excessive memory requirements and limited resolution for fine-grained planning. In this work, we introduce LAMP, a novel neural language field-based navigation framework that learns a continuous, language-driven map and directly leverages it for fine-grained path generation. Unlike prior approaches, our method encodes language features as an implicit neural field rather than storing them explicitly at every location. By combining this implicit representation with a sparse graph, LAMP supports efficient coarse path planning and then performs gradient-based optimization in the learned field to refine poses near the goal. Our two-stage pipeline of coarse graph search followed by language-driven, gradient-guided optimization is the first application of an implicit language map for precise path generation. This refinement mechanism is particularly effective at selecting goal regions even though they were not directly observed, as they exhibit similar semantic characteristics in the learned language feature space. To further enhance robustness, we adopt a Bayesian framework that models embedding uncertainty via the von Mises–Fisher distribution, thereby improving generalization to unobserved regions. Moreover, to scale to large environments, LAMP employs a graph sampling strategy that prioritizes spatial coverage and embedding confidence, retaining only the most informative nodes and substantially reducing computational overhead. Experiments in NVIDIA Isaac Sim and on a real multi-floor building demonstrate that LAMP outperforms existing explicit methods in both memory efficiency and fine-grained goal-reaching accuracy, opening new possibilities for scalable, language-driven robot navigation.

[2-16] HiDF: A Human-Indistinguishable Deepfake Dataset (KDD 2025)
강채원(석박통합과정)
The rapid development and prevalence of generative AI have made it easy for people to create high-quality deepfake images and videos, but their abuses have also increased exponentially. To mitigate potential social disruption, it is crucial to quickly detect the authenticity of each deepfake content hidden in a sea of information. While researchers have worked on developing deep learning-based methods, the deepfake datasets utilized in these studies are far from the real world in terms of their qualities; most popular deepfake datasets are human-distinguishable. To address this problem, we present a novel deepfake dataset, HiDF, a high-quality and human-indistinguishable deepfake dataset consisting of 62K images and 8K videos. HiDF is a meticulously curated dataset that includes diverse subjects that have undergone rigorous quality checks. A comparison of the quality between HiDF and existing deepfake datasets demonstrates that HiDF is human-indistinguishable. Hence, it can be a valuable benchmark dataset for deepfake detection tasks. Data and code (https://github.com/DSAIL-SKKU/HiDF) are publicly available for future deepfake detection research.

[2-17] MSQ: Memory-Efficient Bit Sparsification Quantization (ICCV 2025)
한석호(학부생)
As deep neural networks (DNNs) see increased deployment on mobile and edge devices, optimizing model efficiency has become crucial. Mixed-precision quantization is widely favored, as it offers a superior balance between efficiency and accuracy compared to uniform quantization. However, finding the optimal precision for each layer is challenging. Recent studies utilizing bit-level sparsity have shown promise, yet they often introduce substantial training complexity and high GPU memory requirements. In this paper, we propose Memory-Efficient Bit Sparsification Quantization (MSQ), a novel approach that addresses these limitations. MSQ applies a round-clamp quantizer to enable differentiable computation of the least significant bits (LSBs) from model weights. It further employs regularization to induce sparsity in these LSBs, enabling effective precision reduction without explicit bit-level parameter splitting. Additionally, MSQ incorporates Hessian information, allowing the simultaneous pruning of multiple LSBs to further enhance training efficiency. Experimental results show that MSQ achieves up to 8.00× reduction in trainable parameters and up to 86% reduction in training time compared to previous bit-level quantization, while maintaining competitive accuracy and compression rates. This makes it a practical solution for training efficient DNNs on resourceconstrained devices.

[2-18] NeSyPr: Neurosymbolic Proceduralization For Efficient Embodied Reasoning (NeurIPS 2025)
최원제(박사과정)
We address the challenge of adopting language models (LMs) for embodied tasks in dynamic environments, where online access to large-scale inference engines or symbolic planners is constrained due to latency, connectivity, and resource limitations. To this end, we present NeSyPr, a novel embodied reasoning framework that compiles knowledge via neurosymbolic proceduralization, thereby equipping LM-based agents with structured, adaptive, and timely reasoning capabilities. In NeSyPr, task-specific plans are first explicitly generated by a symbolic tool leveraging its declarative knowledge. These plans are then transformed into composable procedural representations that encode the plans' implicit production rules, enabling the resulting composed procedures to be seamlessly integrated into the LM's inference process. This neurosymbolic proceduralization abstracts and generalizes multi-step symbolic structured path-finding and reasoning into single-step LM inference, akin to human knowledge compilation. It supports efficient test-time inference without relying on external symbolic guidance, making it well suited for deployment in latency-sensitive and resource-constrained physical systems. We evaluate NeSyPr on the embodied benchmarks PDDLGym, VirtualHome, and ALFWorld, demonstrating its efficient reasoning capabilities over large-scale reasoning models and a symbolic planner, while using more compact LMs.

[2-19] Auto-Encoded Supervision for Perceptual Image Super-Resolution (CVPR 2025)
이민규(박사과정)
This work tackles the fidelity objective in the perceptual super-resolution (SR) task. Specifically, we address the shortcomings of pixel-level L_p loss (L_pix) in the GAN-based SR framework. Since L_pix is known to have a trade-off relationship against perceptual quality, prior methods often multiply a small scale factor or utilize low-pass filters. However, this work shows that these circumventions fail to address the fundamental factor that induces blurring. Accordingly, we focus on two points: 1) precisely discriminating the subcomponent of L_pix that contributes to blurring, and 2) only guiding based on the factor that is free from this trade-off relationship. We show that they can be achieved in a surprisingly simple manner, with an Auto-Encoder (AE) pretrained with L_pix. Accordingly, we propose the Auto-Encoded Supervision for Optimal Penalization loss (L_AESOP), a novel loss function that measures distance in the AE space (AE space indicates the space after the decoder, not the bottleneck), instead of the raw pixel space. By simply substituting L_pix with L_AESOP, we can provide effective reconstruction guidance without compromising perceptual quality. Designed for simplicity, our method enables easy integration into existing SR frameworks. Extensive experiments demonstrate the effectiveness of AESOP.

[2-20] Towards Reliable Code-as-Policies: A Neuro-Symbolic Framework for Embodied Task Planning (NeurIPS 2025)
안상현(석사과정)
Recent advances in large language models (LLMs) have enabled the automatic generation of executable code for task planning and control in embodied agents such as robots, demonstrating the potential of LLM-based embodied intelligence. However, these LLM-based code-as-policies approaches often suffer from limited environmental grounding, particularly in dynamic or partially observable settings, leading to suboptimal task success rates due to incorrect or incomplete code generation. In this work, we propose a neuro-symbolic embodied task planning framework that incorporates explicit symbolic verification and interactive validation processes during code generation. In the validation phase, the framework generates exploratory code that actively interacts with the environment to acquire missing observations while preserving task-relevant states. This integrated process enhances the grounding of generated code, resulting in improved task reliability and success rates in complex environments. We evaluate our framework on RLBench and in real-world settings across dynamic, partially observable scenarios. Experimental results demonstrate that our framework improves task success rates by 46.2% over Code as Policies baselines and attains over 86.8% executability of task-relevant actions, thereby enhancing the reliability of task planning in dynamic environments.

Page updated

Google Sites

Report abuse