Poster
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
Wufei Ma · Haoyu Chen · Guofeng Zhang · Yu-Cheng Chou · Celso de Melo · Alan Yuille · Jieneng Chen
3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their applicability to a broader range of applications, such as autonomous navigation, robotics, and AR/VR. Despite the remarkable improvements achieved by large multi-modal models (LMMs) in a wide range of image and video understanding tasks, their abilities to perform 3D spatial reasoning are less studied. In this work we present the first comprehensive 3D spatial reasoning benchmark, 3DSRBench, with 3,000 annotated image question answering triplets from 12 question types. We balance the data distribution by collecting complimentary images that lead to opposite answers given the same question. We also adopt a novel FlipEval for robust evaluation of 3D spatial reasoning capabilities. Moreover, to study the robustness of 3D spatial reasoning w.r.t. camera 3D viewpoints, our 3DSRBench involves two subsets with 3D spatial reasoning questions on images from the same scene with common and uncommon viewpoints. We benchmark a wide range of open-sourced and proprietary LMMs, revealing their limitations in different types of 3D awareness, i.e., height, orientation, location, and multi-object reasoning. Our 3DSRBench also allows us to study the design choices of developing LMMs with strong 3D reasoning capabilities, such as the vision encoders, connectors, and training recipes.
Live content is unavailable. Log in and register to view live content