Decomposed Attention Fusion in MLLMs for
Training-Free Video Reasoning Segmentation
Su Ho Han1* Jeongseok Hyun1* Pilhyeon Lee2 Minho Shim3 Dongyoon Wee3 Seon Joo Kim1
1Yonsei University 2Inha University 3NAVER Cloud
(*Equal contribution)
Overview video of our method.
Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates entirely without retraining. DecAF outperforms training-free methods and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks.
Visualization of our method. (a) Noise in irrelevant regions is suppressed by contrastive fusion with the background attention map. As shown in the first frame, background activations are removed, and the target object is emphasized. (b) Video attention map captures temporal cues, while frame attention map highlights object-centric details. Their fusion resolves conflicts (e.g., identifying the server vs. the hitting player) and produces more consistent localization. The attention mask is obtained directly from the attention map while the SAM mask is generated by SAM2.
Overview of DecAF. (a) Attention rollout with our V-Max normalization produces a rollout matrix that accumulates attention across layers, from which visual-token scores for the final query token are extracted as attention maps for grounding. (b) Contrastive fusion suppresses attention scores on background regions. (c) Complementary fusion integrates video- and frame-level cues. (d) These fusion methods are combined into the full pipeline to refine noisy attention maps.
Overview of our SAM prompting pipeline with attention maps. (1) Point queries for SAM2 are obtained from attention maps via thresholding ($\tau_{pq}$). (2) During mask propagation, highly overlapping masks are removed. (3) Spurious mask tracklets are removed using our scoring method.
Comparison of MLLM-based text-conditioned VOS methods that directly compute masks from attention maps (Attn Mask). All methods are training-free and grouped by MLLM.
Comparison of MLLM-based text-conditioned VOS methods. The upper gray rows correspond to training-based methods, while the lower colored rows correspond to training-free methods.
Comparison on additional datasets.
Ablation Study.
Qualitative results for the single object case.
Qualitative results for the multiple object case.
Qualitative results for the small object case.
Qualitative results for the temporal reasoning case.
Qualitative results for the world knowledge case.
Will be updated soon!