Skip to yearly menu bar Skip to main content


Poster

Object-centric Video Question Answering with Visual Grounding and Referring

Haochen Wang · Qirui Chen · Cilin Yan · Jiayin Cai · Xiaolong Jiang · Yao Hu · Weidi Xie · Stratis Gavves


Abstract:

Video Large Language Models (VideoLLMs) have recently demonstrated remarkable progress in general video understanding. However, existing models primarily focus on high-level comprehension and are limited to text-only responses, restricting the flexibility for object-centric, multi-round interactions. In this paper, we make three contributions:(i) we address these limitations by introducing a VideoLLM, termed as RGA3, capable of performing both object referring and grounding for video reasoning tasks in a multi-round conversational manner, i.e., allowing users to iteratively interact with videos using both textual and visual queries; (ii) we propose STOM (Spatial-Temporal Overlay Module), a novel approach that allows arbitrary visual prompts to be processed at any timestamp within a video;(iii) we present VideoInfer, a manually curated object-centric video instruction dataset featuring question-answering pairs that require reasoning. We conduct comprehensive experiments on VideoInfer and other existing benchmarks across video question answering and referring video object segmentation. The results on 12 benchmarks spanning 6 tasks show that RGA3 consistently outperforms baseline models in both video question answering and segmentation, underscoring its robustness in multimodal, object-centric video and image understanding. The code, dataset, and web demo will be publicly released.

Live content is unavailable. Log in and register to view live content