ICCV Poster MM-IFEngine: Towards Multimodal Instruction Following

Poster

MM-IFEngine: Towards Multimodal Instruction Following

Shengyuan Ding · Wu Shenxi · Xiangyu Zhao · Yuhang Zang · Haodong Duan · Xiaoyi Dong · Pan Zhang · Yuhang Cao · Dahua Lin · Jiaqi Wang

Exhibit Hall I #95

[ Abstract ] [ Project Page ]

Tue 21 Oct 2:45 p.m. PDT — 4:45 p.m. PDT

Abstract: The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and doing it right.Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints.To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs.Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO).We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both textual constraints for output responses and visual constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating rule-based assessment and LLM-as-a-Judge evaluation.We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieve notable gains on various IF benchmarks, such as MM-IFEval (+11.8$\%$), MIA (+7.7$\%$), and IFEval (+10.5$\%$).

Live content is unavailable. Log in and register to view live content