ICCV Poster MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling

Poster

MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling

Yingyue Li · Bencheng Liao · Wenyu Liu · Xinggang Wang

[ Abstract ]

Abstract:

With the advancement of RNN models with linear complexity, the quadratic complexity challenge of transformers has the potential to be overcome. Notably, the emerging Mamba-2 has demonstrated competitive performance, bridging the gap between RNN models and transformers. However, due to sequential processing and vanishing gradients, RNN models struggle to capture long-range dependencies, leading to slow convergence, high resource demands, and suboptimal performance on downstream understanding and complex reasoning tasks. In this work, we introduce MaTVLM, a hybrid model that replaces a portion of the transformer decoder layers in a pre-trained VLM with Mamba-2 layers. By leveraging the inherent relationship between attention and Mamba-2, we initialize Mamba-2 with corresponding attention weights to accelerate convergence. We further enhance training efficiency through a single-stage distillation process, using the pre-trained VLM as a teacher model to transfer knowledge to MaTVLM. Additionally, we explore the impact of differential distillation losses within our training framework. Evaluations across multiple benchmarks demonstrate that MaTVLM achieves competitive performance against the teacher model and existing VLMs while outperforming both Mamba-based VLMs and models with similar parameter scales. Remarkably, MaTVLM attains up to 3.6× faster inference than the teacher model and reduces GPU memory consumption by 27.5%, all without compromising performance.

Live content is unavailable. Log in and register to view live content