Skip to yearly menu bar Skip to main content


Poster

Structured Policy Optimization: Enhance Large Vision-Language Model via Self-referenced Dialogue

Guohao Sun · Can Qin · Yihao Feng · Zeyuan Chen · Ran Xu · Sohail Dianat · MAJID RABBANI · Raghuveer Rao · Zhiqiang Tao


Abstract:

Preference optimization algorithms typically enhance LLM response quality by leveraging human feedback on multiple answers given a fixed instruction. However, these methods often lack capturing the dynamic nature of conversational exchanges. For large vision-language models (LVLMs), direct preference optimization (DPO) can over-emphasize linguistic nuances while overlooking visual context. To address this challenge, we introduce structured policy optimization (SPO) -- a novel preference optimization method that simultaneously aligns preference instructions, responses, and dialogue interactions to improve multi-modal understanding and reasoning capabilities. The efficacy of SPO is attributed to one key design:treating the questioning and answering as a sequential action and binding them through a trajectory reward. This reward formulation better aligns with real-world dialogue studies and eliminates the need for fixed instructions. We evaluate our models on interleaved benchmarks, including image, multi-image, and video-based understanding and reasoning tasks. Experimental results show that the proposed SPO fine-tuning LVLM with multi-modal preference data can align with human preference more efficiently than DPO.

Live content is unavailable. Log in and register to view live content