ICCV Poster p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

Poster

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

Jun Zhang · Desen Meng · Zhengming Zhang · Zhenpeng Huang · Tao Wu · Limin Wang

[ Abstract ]

Abstract: Despite the remarkable performance of multimodal large language models (MLLMs) across diverse tasks, the substantial training and inference costs impede their advancement. In this paper, we propose \model{}, an efficient MLLM architecture that significantly reduces training and inference costs while maintaining model performance.The majority of computation in MLLMs stems from the overwhelming volume of vision tokens processed by the transformer-based LLM. Accordingly, we leverage the $\textbf{Mixture-of-Depths}$ (MoD) mechanism, where each LLM layer selects essential vision tokens to process while skipping redundant ones. However, integrating MoD into MLLMs is non-trivial. To address the challenges of training and inference stability as well as limited training data, we adapt the MoD module with two novel designs: tanh-gated weight normalization ($\textbf{TanhNorm}$) and symmetric token reweighting ($\textbf{STRing}$). Moreover, we observe that vision tokens exhibit higher redundancy in deeper layers and thus design a progressive ratio decay ($\textbf{PRD}$) strategy, which gradually reduces the token retention ratio layer by layer, employing a shifted cosine schedule. This crucial design fully unleashes the potential of MoD, significantly boosting the efficiency and performance of our models. Extensive experiments on two baseline models across 15 benchmarks show that our model matches or even surpasses the performance of corresponding baselines, while requiring only 55.6\% TFLOPs and 53.7\% KV cache storage during inference, and 77.7\% GPU hours during training.

Live content is unavailable. Log in and register to view live content