Poster Tue, Oct 21, 2025 • 2:45 PM – 4:45 PM PDT Exhibit Hall I #356

VisNumBench: Evaluating Number Sense of Multimodal Large Language Models

Tengjin Weng · Jingyi Wang · Wenhao Jiang · Zhong Ming

Project Page [ Poster]

Abstract

Can Multimodal Large Language Models (MLLMs) develop an intuitive number sense similar to humans? Targeting this problem, we introduce Visual Number Benchmark (VisNumBench) to evaluate the number sense abilities of MLLMs across a wide range of visual numerical tasks. VisNumBench consists of about $1,900$ multiple-choice question-answer pairs derived from both synthetic and real-world visual data, covering seven visual numerical attributes and four types of visual numerical estimation tasks.Our experiments on VisNumBench led to the following key findings:(i) The 17 MLLMs we tested—including open-source models such as Qwen2.5-VL and InternVL2.5, as well as proprietary models like GPT-4o and Gemini 2.0 Flash—perform significantly below human levels in number sense-related tasks.(ii) Multimodal mathematical models and multimodal chain-of-thought (CoT) models did not exhibit significant improvements in number sense abilities.(iii) Stronger MLLMswith larger parameter sizes and broader general abilities demonstrate modest gains in number sense abilities.We believe VisNumBench will serve as a valuable resource for the research community, encouraging further advancements in enhancing LVLMs' number sense abilities. All benchmark resources, including code and datasets, will be publicly released upon the paper’s acceptance.

Chat is not available.