Cobra

Extending Mamba to Multi-modal Large Language Model for Efficient Inference

Han Zhao12, Min Zhang1, Wei Zhao1, Pengxiang Ding1, Siteng Huang1, and Donglin Wang1

1  MiLAB, School of Engineering, Westlake University                          2 Zhejiang University

[ArXiv]           [Code]           [Model]           [Demo]

Abstract

In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, current MLLMs are composed of the well-known Transformer network, which has a less efficient quadratic computation complexity. To improve the efficiency of such basic models, we propose Cobra, a linear computational complexity MLLM. Specifically, Cobra integrates the efficient Mamba language model into the visual modality. Moreover, we explore and study various modal fusion schemes to create an effective multi-modal Mamba. Extensive experiments demonstrate that (1) Cobra achieves extremely competitive performance with current computationally efficient state-of-the-art methods, e.g., LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and has faster speed due to Cobra’s linear sequential modeling. (2) Interestingly, the results of closed-set challenging prediction benchmarks show that Cobra performs well in overcoming visual illusions and spatial relationship judgments. (3) Notably, Cobra even achieves comparable performance to LLaVA with about 43% of the number of parameters. We hope that the proposed method can facilitate future research on complexity problems in MLLM.

Overview

We fuse DINOv2 and SigLIP as our vision backbone.  The LLM backbone is a Mamba language model with 2.8B parameters. The projector is a simple learnable MLP that aligns the features of vision and text. During training, the parameters of vision encoders are frozen and we fine-tune the parameters of the projector and Mamba LLM backbone

Inference Speed

Cobra performs 3× ∼ 4× faster than MobileVLM v2 3B and TinyLLaVA 3B on a single NVIDIA A100 80G GPU

Comparison with Baselines

Cobra achieves comparable performance to LLaVA v1.5 7B with about 43% of the number of parameters

Ablation

We do several ablation studies on projectors (MLP or Lightweight Downsample Projector), vision backbones (DINOv2 + SigLIP or SigLIP only), and LLM backbones (base model or instruction-tuned chat model)

Cases

BibTeX

@article{zhao2024cobra,      title={Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference},       author={Han Zhao and Min Zhang and Wei Zhao and Pengxiang Ding and Siteng Huang and Donglin Wang},      year={2024},      eprint={2403.14520},      archivePrefix={arXiv},      primaryClass={cs.CV}}