Skip to yearly menu bar Skip to main content


Poster

VisionMath: Vision-Form Mathematical Problem-Solving

Zongyang Ma · Yuxin Chen · Ziqi Zhang · Zhongang Qi · Chunfeng Yuan · Shaojie Zhu · Chengxiang Zhuo · Bing Li · Ye Liu · Zang Li · Ying Shan · Weiming Hu


Abstract:

Mathematical problems in real-world scenarios are often presented in a purely vision-form, where textual problem statement and accompanying math figures, e.g., geometry figures and functional graphs, are integrated into a single image. This vision-form problem-solving task requires precise comprehension and reasoning on both textual and graphical elements in the images, posing significant challenge to current Multimodal Large Language Models (MLLMs), which process text and math figures in isolation. In this work, we propose VisionMath, the first exploration for vision-form mathematical problem-solving model, which employs a three-stage progressive multimodal reasoning alignment strategy to systematically enhance task-specific capabilities. Building upon a LLM proficient in unimodal mathematical reasoning, VisionMath first establishes foundational OCR capabilities through capturing rendered mathematical problem images. Subsequently, the model develops comprehensive understanding of figure structures and properties via learning from figure descriptions and mathematical educational videos. Finally, the model's reasoning capacity is activated using carefully constructed visual-form problem-solving datasets VisionMath-IT with chain-of-thought annotations. For comprehensive evaluation, we construct multilingual benchmarks covering diverse problem types, including geometry, algebra, function problems in both English and Chinese. Our model weights, data and code will be public available.

Live content is unavailable. Log in and register to view live content