Poster
VSP: Diagnosing the Dual Challenges of Perception and Reasoning in Spatial Planning Tasks for MLLMs
Qiucheng Wu · Handong Zhao · Michael Saxon · Trung Bui · William Yang Wang · Yang Zhang · Shiyu Chang
Multimodal large language models are an exciting emerging class of language models (LMs) that have merged classic LM capabilities with those of image processing systems. However, how these capabilities integrate is often not intuitive and warrants direct investigation. One understudied capability in MLLMs is visual spatial planning---the ability to comprehend the spatial arrangements of objects and devise action plans to achieve desired outcomes in visual scenes. It is unclear why MLLMs fall short on these tasks generally considered easy for humans, given their successes across other diverse scenarios. To this end, we introduce VSP, a benchmark that 1) evaluates the spatial planning capability in MLLMs in general, and 2) diagnoses this capability via finer-grained sub-tasks, including perception and reasoning, and measure the capabilities of models through these sub-tasks. Our evaluation confirms that both open-source and private MLLMs fail to generate effective plans for even simple spatial planning tasks. Evaluations on the fine-grained analytical tasks further reveal fundamental deficiencies in the models’ visual perception and bottlenecks in reasoning abilities, explaining their worse performance in the general spatial planning tasks. Our work illuminates future directions for improving MLLMs' abilities in spatial planning.
Live content is unavailable. Log in and register to view live content