SiLVR : A Simple Language-based Video Reasoning Framework

Ce Zhang*, Yan-Bo Lin*, Ziyang Wang, Mohit Bansal, Gedas Bertasius

UNC Chapel Hill

Recent advances in test-time optimization have led to remarkable reasoning capabilities in Large Language Models (LLMs), enabling them to solve highly complex problems in math and coding. However, the reasoning capabilities of multimodal LLMs (MLLMs) still significantly lag, especially for complex video-language tasks. To address this issue, we present SiLVR, a Simple Language-based Video Reasoning framework that decomposes complex video understanding into two stages. In the first stage, SiLVR transforms raw video into language-based representations using multisensory inputs, such as short clip captions and audio/speech subtitles. In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks. To handle long-context multisensory inputs, we use an adaptive token reduction scheme, which dynamically determines the temporal granularity with which to sample the tokens. Our simple, modular, and training-free video reasoning framework achieves the best-reported results on Video-MME (long), Video-MMMU (comprehension), Video-MMLU, CGBench, and EgoLife. Furthermore, our empirical study focused on video reasoning capabilities shows that, despite not being explicitly trained on video, strong reasoning LLMs can effectively aggregate multisensory input information from video, speech, and audio for complex temporal, causal, long-context, and knowledge acquisition reasoning tasks in video. Code is available at https://github.com/CeeZh/SILVR.

Motivation

Large Language Models (LLMs) excel at complex reasoning, but multimodal LLMs (MLLMs) still lag behind on complex video-language tasks.
Recent MLLMs have explored RL-based reasoning frameworks, but they often rely on task-specific rewards, leading to poor generalization. Additionally, these RL-based methods are resource-intensive and may underperform compared to the supervised fine-tuning (SFT) baselines.

We want to build a simple, modular, training-free, yet highly-performant framework for complex video-language reasoning tasks.

The SiLVR Framework

Our method decomposes video-language QA into two stages:

In the first stage, we convert raw videos into rich language-based descriptions. Specifically, we densely sample short clips from the input videos and use a pre-trained visual captioner (e.g., NVILA) to extract captions for each clip. Additionally, we use automatic speech recognition (ASR) tools to convert speech into language descriptions.
In the second stage, we feed the rich language descriptions into a strong reasoning LLM (e.g.DeepSeek-R1) to solve complex video-language understanding tasks.

Such a decomposed video reasoning design offers several benefits: 1) Simplicity: SiLVR does not require complex RL-based optimization or specialized modules for different tasks. 2) Generalizability: our method can be applied to a wide range of complex video-language tasks without task-specific fine-tuning. 3) Modularity: our method’s modular design enables seamless use of powerful visual captioning models and strong reasoning LLMs. 4) Flexibility: SiLVR supports plug-and-play integration of different captioning models, speech recognition models, and LLMs.

Adaptive Token Reduction

Unlike prior video reasoning approaches, SiLVR performs reasoning entirely in the language space. However, the limited context window of LLMs poses a significant challenge when processing long videos with rich multimodal content.

To address this issue, we introduce a simple adaptive token reduction scheme. Our adaptive token reduction scheme dynamically adjusts the temporal granularity for sampling video tokens. Specifically, it starts with a small clip length and progressively increases it to reduce the total number of generated tokens.

This allows us to effectively fit the input tokens within the LLM’s context window for videos of varying durations while maintaining strong video reasoning performance.

Evaluation Benchmarks

We conduct experiments on eight complex video-language understanding benchmarks: Video-MMMU, Video-MMLU, MMVU, MMWorld, Video-MME, CGBench, EgoLife and CinePile.

We group these benchmarks into two categories: Reasoning Benchmarks and General Benchmarks.

Reasoning Benchmarks: Video-MMMU, Video-MMLU, MMVU, and MMWorld, which primarily evaluate the reasoning capabilities of large video-language models.
General Benchmarks: Video-MME, CGBench, EgoLife, and CinePile, which contain various types of questions and offer a comprehensive assessment of the video-language models.

Main Results

We use the comprehension split of Video-MMMU and the long split of VideoMME (with subtitles). SiLVR achieves the best-reported results on Video-MMMU (comprehension), Video-MMLU, Video-MME (long split, with subtitles), CGBench, and EgoLife, outperforming strong proprietary models such as Gemini 2.0 and GPT-4o. We bold and underline the best and the second best models in each benchmark respectively.

We also evaluate our method on two additional tasks:

Knowledge Acquisition on Video-MMMU
Temporally Grounded QA on CGBench

SiLVR achieves state-of-the-art performance on both tasks, showing strong generalizability.

Reasoning Analysis

To study the impact of a strong reasoning LLM within our framework, we compare the performance of our method when using a reasoning LLM (DeepSeek- R1) vs. a non-reasoning LLM (Llama 4). We observe that:

DeepSeek-R1 consistently outperforms Llama 4 across all benchmarks, indicating that it is a stronger LLM than LLama 4.
DeepSeek-R1 leads to much larger performance gains on the reasoning benchmarks. In contrast, while DeepSeek-R1 also produces better results on general video benchmarks, the improvements over Llama 4 are much smaller.

These results suggest that the strong reasoning ability of DeepSeek-R1 is critical for solving complex video reasoning tasks and that our framework’s simple and modular design allows us to take full advantage of DeepSeek-R1’s strong reasoning abilities on these complex video reasoning problems.

We report the performance gains of using a reasoning LLM (DeepSeek-R1) over a non-reasoning LLM (Llama 4) for different question categories on VideoMME.

We observe that compared to LLama 4, using DeepSeek-R1 achieves a significantly larger improvement on reasoning questions (a gain of +11.1%) compared to non-reasoning questions (a gain of +4.9%).

This result is consistent with our observations in the prior table, which confirms that reasoning LLMs bring greater benefits for tasks that require complex reasoning.

Ablation Study

Adaptive Token Reduction

To evaluate the relative contribution of visual and audio information, we vary the fraction of tokens from speech transcripts and video captions and report the QA performance on VideoMME.

From the table, we observe that speech tokens are more informative than visual caption tokens.

We compare Adaptive Token Reduction with several static baselines that use fixed video clip lengths.

Static Baselines. Among all baselines, the variant that uses an 8-second clip length achieves the highest accuracy of 74.2%. We note that a shorter clip variant (e.g., 1s) generates a large number of captions for long videos, which often exceeds the context window of the LLMs, thus leading to degraded performance. In contrast, a longer clips variant (e.g., 64s) reduces the number of captions at the cost of sacrificing the granularity of visual information, which also leads to lower accuracy.
Adaptive Token Reduction effectively reduces redundant tokens by adaptively adjusting the clip length, consistently outperforming all fixed clip length baselines.

Visual Captioner and LLM

Qwen-2.5-VL 72B achieves the best overall accuracy, possibly because of its larger model size. NVILA 7B and Qwen-2.5-VL 7B provide the best accuracy-cost trade-off.

DeepSeek R1 achieves the highest overall accuracy. Llama-4 Maverick achieves 66.2% overall accuracy, providing an effective trade-off between model sizes and performance.

Qualitative Analysis

BibTex

@article{zhang2025silvr,

title={SiLVR: A Simple Language-based Video Reasoning Framework},

author={Zhang, Ce and Lin, Yan-Bo and Wang, Ziyang and Bansal, Mohit and Bertasius, Gedas},

year={2025},

journal={arXiv preprint arXiv:2505.24869},
}

Page updated

Report abuse