BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang,
Gedas Bertasius, Lorenzo Torresani
Meta AI and UNC Chapel Hill
Accepted by CVPR 2025
Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang,
Gedas Bertasius, Lorenzo Torresani
Meta AI and UNC Chapel Hill
Accepted by CVPR 2025
Our proposed BIMBA uses a Mamba-based Spatiotemporal Token Selector to select a reduced number of salient tokens from a long sequence of features extracted via a pretrained image encoder. The token selection is optionally conditioned using the textual query to identify the features that are most informative for answering a given question. Finally, the selected and transformed tokens are passed to a large language model with a tokenized version of the input question to generate the answer.
(a): Architecture of our Spatiotemporal Token Selector. (b): Traditional selective scan with queries appended at the sequence's start or end introduces positional biases that often lead to suboptimal performance. (c) We propose to interleave the queries uniformly to capture interactions between spatiotemporal tokens across the video more evenly. (d) Furthermore, we introduce a bidirectional selective scan (forward and backward) operation to improve the long-range modeling further.
We compare BIMBA with state-of-the-art video MLLMs across seven diverse video question-answering benchmarks. BIMBA-LLaVA achieves the highest performance on all datasets when using the Qwen2-7B LLM backbone (third section).
On the EgoSchema benchmark, our model surpasses the previous best method, LongVU, by 3.54%, demonstrating its superior ability to comprehend egocentric videos and handle questions that require long context understanding.
On VNBench, which focuses on needle-in-the-haystack questions, our approach outperforms LLaVA-Video by 7.11%, highlighting its strong capability to extract key information from very long videos.
Furthermore, on benchmarks requiring long video comprehension, such as LongVideoBench, Video-MME, and MLVU, our model sets a new state-of-the-art, further demonstrating its effectiveness in processing and understanding hour-long videos.
Lastly, since different MLLMs leverage varying LLM backbones and training data, we also conduct a fair comparison by evaluating our model against four baselines trained using the same 370K instruction-tuning dataset and using Vicuna-7B and LLaMA3.2-8B LLM decoders (second section)
In addition to comparing with prior methods, we also compare our model with four baselines to analyze the effectiveness of the proposed selective-scan compression technique.
Vanilla: Removes the spatiotemporal token selector from our model, resulting in no token compression.
Pooling: Uses spatiotemporal pooling for compression, matching our model’s compression ratio.
Self-Attention: Replaces the selective-scan layers of our model with self-attention layers.
Perceiver: Adopts the widely used Perceiver mechanism for token compression at the same ratio as our model.
Here, we present the accuracy achieved by BIMBA-LLaVA (Vicuna-7B) and baseline models on NeXT-QA (left) and EgoSchema (right) as a function of the number of input tokens.
BIMBA achieves the highest accuracy for all sequence lengths, and the difference with other baselines increases as we increase the number of input tokens.
Self-attention and Vanilla cannot be applied to long sequences as they cause GPU out-of-memory issues once the number of tokens becomes too large.
BIMBA also outperforms the Pooling and the Perceiver baselines in all scenarios, demonstrating its superior effectiveness.
This figure shows the computation costs of BIMBA-LLaVA (Vicuna-7B) and baseline models in terms of memory usage (left) and runtime (right).
Vanilla and Self-Attention baselines run quickly out of memory as the number of input tokens is increased.
While BIMBA, Perceiver, and Pooling maintain low memory and runtime costs, our method achieves the highest accuracy across all input lengths, as shown in the previous section.
BIMBA also excels at answering open-ended video questions. The examples showcase the model's ability to handle diverse video understanding tasks, including generating detailed descriptions, recognizing objects and interactions, identifying fine-grained activities, and inferring high-level goals. This illustrates the model's effectiveness in general-purpose video understanding.
@article{islam2025bimba,
title={BIMBA: Selective-Scan Compression for Long-Range Video Question Answering},
author={Islam, Md Mohaiminul and Nagarajan, Tushar and Wang, Huiyu and Bertasius, Gedas and Torresani, Lorenzo},
journal={arXiv preprint arXiv:2503.09590},
year={2025}
}