ICCV Poster 4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

Poster

4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

Wenxuan Zhu · Bing Li · Cheng Zheng · Jinjie Mai · Jun Chen · Letian Jiang · Abdullah Hamdi · Sara Rojas Martinez · Chia-Wen Lin · Mohamed Elhoseiny · Bernard Ghanem

[ Abstract ]

Abstract:

Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities.However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects.In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding, featuring tasks in 4D object Question Answering (4D object QA) and 4D object captioning.4D-Bench provides 4D objects with diverse categories, high-quality annotations, and tasks necessitating multi-view spatial-temporal understanding, different from existing 2D image/video-based benchmarks.With 4D-Bench, we evaluate a wide range of open-source and closed-source MLLMs.The results from the 4D object captioning experiment indicate that MLLMs generally exhibit weaker temporal understanding compared to their appearance understanding, notably, while open-source models approach closed-source performance in appearance understanding, they show larger performance gaps in temporal understanding.4D object QA yields surprising findings: even with simple single-object videos, MLLMs perform poorly, with state-of-the-art GPT-4o achieving only 63\% accuracy compared to the human baseline of 91\%.These findings highlight a substantial gap in 4D object understanding and the need for further advancements in MLLMs.

Live content is unavailable. Log in and register to view live content