1 Zhejiang University 2 Westlake University 3 DAMO Academy, Alibaba Group
Abstract
In recent years, applying multi-modal large language models (MLLMs) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, MLLMs comprise the well-known Transformer network, which has a less efficient quadratic computation complexity. In this paper, we introduce Cobra, a multi-modal large-scale language model built upon a state-space model, which has demonstrated significant potential in efficiently handling long sequences with fast inference and linear scalability concerning sequence length. Specifically, Cobra involves replacing Transformer-based backbone models (e.g., LLaMA or Phi) with pre-trained Mamba language models. We then empirically explore effective strategies for aligning visual and textual modalities and integrating various pretrained Mamba model variants with visual encoders. Experiments across various multi-modal benchmarks demonstrate that: (i) Cobra performs 3× ∼ 4× faster than the most computationally efficient state-of-the-art methods, e.g., LLaVAPhi and MobileVLM v2. Additionally, its performance is significantly enhanced thanks to the implementation of linear sequential modeling. (ii) Cobra fine-tunes a small parameter (∼ 48% of model parameters), leading to a significant improvement in overall performance compared to LLaVA.
We fuse DINOv2 and SigLIP as our vision backbone. The LLM backbone is a Mamba language model with 2.8B/7B parameters. The projector is a simple learnable MLP that aligns the features of vision and text. During training, the parameters of vision encoders are frozen and we fine-tune the parameters of the projector and Mamba LLM backbone
Cobra-3.5B performs 3× ∼ 4× faster than MobileVLM v2 3B and LLaVA-Phi 3B on a single NVIDIA A100 80G GPU
Cobra-3.5B achieves comparable performance to LLaVA v1.5 7B with about 43% of the number of parameters, and Cobra-8B surpasses LLaVA v1.5 7B on all benchmarks.
We do several ablation studies on vision backbones (DINOv2 + SigLIP or SigLIP only), projectors (MLP or lightweight downsample projector), LLM backbones (base model or instruction-tuned chat model), and training strategies (pretraining or directly fine-tuning for different number of epochs)