ICCV Poster Deciphering Cross-Modal Alignment in Large Vision-Language Models via Modality Integration Rate

Poster

Deciphering Cross-Modal Alignment in Large Vision-Language Models via Modality Integration Rate

Qidong Huang · Xiaoyi Dong · Pan Zhang · Yuhang Zang · Yuhang Cao · Jiaqi Wang · Weiming Zhang · Nenghai Yu

[ Abstract ]

Abstract: Multi-modal pre-training plays a pivotal role in aligning two modalities for Large Vision-Language Models (LVLMs), while evaluating its training quality usually requires the costly supervised fine-tuning (SFT) stage to verify the downstream benchmark scores. Loss, perplexity, and in-context evaluation results are commonly used pre-training metrics for Large Language Models (LLMs), while we observed that these metrics are less indicative when quantifying the pre-trained LVLMs. Due to the lack of proper metrics, the research of LVLMs in the critical pre-training stage is hindered greatly, including the training data choice, efficient module design, etc.In this paper, we first present Modality Integration Rate ($\textbf{MIR}$), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of LVLMs without SFT. This metric evaluates LVLM pre-training from the inter-modal distribution distance perspective, which is 1) $\textbf{Effective}$ to represent the pre-training quality and show a positive relation with the benchmark performance after SFT, 2) $\textbf{Robust}$ toward different training/evaluation data, and 3) $\textbf{Generalize}$ across training configurations and architecture choices.Complementing MIR, we further propose learnable Modality Calibration ($\textbf{MoCa}$), a lightweight module to narrow the modality gap at each language model layer during training. A series of experiments are conducted to explore the effectiveness of MIR and MoCa, demonstrating that MIR is highly indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results. We hope MIR could be a helpful evaluator for building capable LVLMs and inspire the following research about modality alignment in different areas.

Live content is unavailable. Log in and register to view live content