Skip to yearly menu bar Skip to main content


Poster

Integrating Visual Interpretation and Linguistic Reasoning for Geometric Problem Solving

Zixian Guo · Ming Liu · Qilong Wang · Zhilong Ji · Jinfeng Bai · Lei Zhang · Wangmeng Zuo


Abstract:

In addressing geometric problems, the reasoning capabilities demonstrated by existing large vision-language models (LVLMs) are significantly inferior to those of their corresponding large language model (LLM) backbones. We attribute this issue to the inadequate alignment and joint comprehension of visual and linguistic features. Furthermore, the imprecise information extracted from images by LVLMs further impairs their reasoning abilities. To this end, we propose a dual-mind architecture that can capture detailed visual information from images and facilitate effective linguistic reasoning through joint optimization. Different from the existing supervised fine-tune pipeline, which makes LVLMs conduct problem-solving directly, we let the LVLMs interpret the visual content first. LVLMs extract key elements like precise geometric primitives and spatial relationships as natural language conditions. Then, LLM serves as a linguistic reasoner for deriving the answer through step-by-step reasoning. The visual interpreting module and the linguistic reasoning module can effectively collaborate by an outcome-rewarded joint tuning strategy. By solving the multimodal question using the dual-mind of LVLM and LLM, we achieve significant improvements in visually intensive geometric math problems. This work advances multimodal reasoning by a new coupled architecture with explicit visual perception and linguistic reasoning, which can overcome the limitations of current LVLMs.The code will be made publicly available.

Live content is unavailable. Log in and register to view live content