BIMBA: Selective-Scan Compression for Long-Range Video Question Answering

Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang,

Meta AI and UNC Chapel Hill

Accepted by CVPR 2025

Video Question Answering (VQA) in long videos poses the key challenge of extracting relevant information and modeling long-range dependencies from many redundant frames. The self-attention mechanism provides a general solution for sequence modeling, but it has a prohibitive cost when applied to a massive number of spatiotemporal tokens in long videos. Most prior methods rely on compression strategies to lower the computational cost, such as reducing the input length via sparse frame sampling or compressing the output sequence passed to the large language model (LLM) via space-time pooling. However, these naive approaches over-represent redundant information and often miss salient events or fast-occurring space-time patterns. In this work, we introduce BIMBA, an efficient state-space model to handle long-form videos. Our model leverages the selective scan algorithm to learn to effectively select critical information from high-dimensional video and transform it into a reduced token sequence for efficient LLM processing. Extensive experiments demonstrate that BIMBA achieves state-of-the-art accuracy on multiple long-form VQA benchmarks, including PerceptionTest, NExT-QA, EgoSchema, VNBench, LongVideoBench, Video-MME, and MLVU.

BIMBA Model Architecture

Our proposed BIMBA uses a Mamba-based Spatiotemporal Token Selector to select a reduced number of salient tokens from a long sequence of features extracted via a pretrained image encoder. The token selection is optionally conditioned using the textual query to identify the features that are most informative for answering a given question. Finally, the selected and transformed tokens are passed to a large language model with a tokenized version of the input question to generate the answer.

Selective-Scan Spatiotemporal Token Selector

(a): Architecture of our Spatiotemporal Token Selector. (b): Traditional selective scan with queries appended at the sequence's start or end introduces positional biases that often lead to suboptimal performance. (c) We propose to interleave the queries uniformly to capture interactions between spatiotemporal tokens across the video more evenly. (d) Furthermore, we introduce a bidirectional selective scan (forward and backward) operation to improve the long-range modeling further.

Experimental Results

Main Results

We compare BIMBA with state-of-the-art video MLLMs across seven diverse video question-answering benchmarks. BIMBA-LLaVA achieves the highest performance on all datasets when using the Qwen2-7B LLM backbone (third section).
On the EgoSchema benchmark, our model surpasses the previous best method, LongVU, by 3.54%, demonstrating its superior ability to comprehend egocentric videos and handle questions that require long context understanding.
On VNBench, which focuses on needle-in-the-haystack questions, our approach outperforms LLaVA-Video by 7.11%, highlighting its strong capability to extract key information from very long videos.
Furthermore, on benchmarks requiring long video comprehension, such as LongVideoBench, Video-MME, and MLVU, our model sets a new state-of-the-art, further demonstrating its effectiveness in processing and understanding hour-long videos.
Lastly, since different MLLMs leverage varying LLM backbones and training data, we also conduct a fair comparison by evaluating our model against four baselines trained using the same 370K instruction-tuning dataset and using Vicuna-7B and LLaMA3.2-8B LLM decoders (second section)

Comparison with Other Compression Methods

In addition to comparing with prior methods, we also compare our model with four baselines to analyze the effectiveness of the proposed selective-scan compression technique.

Vanilla: Removes the spatiotemporal token selector from our model, resulting in no token compression.
Pooling: Uses spatiotemporal pooling for compression, matching our model’s compression ratio.
Self-Attention: Replaces the selective-scan layers of our model with self-attention layers.
Perceiver: Adopts the widely used Perceiver mechanism for token compression at the same ratio as our model.

Here, we present the accuracy achieved by BIMBA-LLaVA (Vicuna-7B) and baseline models on NeXT-QA (left) and EgoSchema (right) as a function of the number of input tokens.

BIMBA achieves the highest accuracy for all sequence lengths, and the difference with other baselines increases as we increase the number of input tokens.
Self-attention and Vanilla cannot be applied to long sequences as they cause GPU out-of-memory issues once the number of tokens becomes too large.
BIMBA also outperforms the Pooling and the Perceiver baselines in all scenarios, demonstrating its superior effectiveness.

Computation Cost Analysis

This figure shows the computation costs of BIMBA-LLaVA (Vicuna-7B) and baseline models in terms of memory usage (left) and runtime (right).

Vanilla and Self-Attention baselines run quickly out of memory as the number of input tokens is increased.
While BIMBA, Perceiver, and Pooling maintain low memory and runtime costs, our method achieves the highest accuracy across all input lengths, as shown in the previous section.

Qualitative Results

BIMBA also excels at answering open-ended video questions. The examples showcase the model's ability to handle diverse video understanding tasks, including generating detailed descriptions, recognizing objects and interactions, identifying fine-grained activities, and inferring high-level goals. This illustrates the model's effectiveness in general-purpose video understanding.

BibTex

@article{islam2025bimba,

title={BIMBA: Selective-Scan Compression for Long-Range Video Question Answering},

author={Islam, Md Mohaiminul and Nagarajan, Tushar and Wang, Huiyu and Bertasius, Gedas and Torresani, Lorenzo},

journal={arXiv preprint arXiv:2503.09590},

year={2025}

}

Page updated

Google Sites

Report abuse