Skip to yearly menu bar Skip to main content


Poster

TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding

Zuhao Yang · Yingchen Yu · Yunqing Zhao · Shijian Lu · Song Bai


Abstract:

Video Temporal Grounding (VTG) aims to precisely identify video event segments in response to textual queries.The outputs of VTG tasks manifest as sequences of events, each defined by precise timestamps, saliency scores, and textual descriptions.Despite recent advances, a fundamental limitation persists in existing Video Large Language Models (Video-LLMs): they process all task tokens through identical and static pathways, failing to recognize that temporal localization, saliency assessment, and textual generation represent fundamentally distinct tasks requiring specialized processing. To address this, we introduce TimeExpert, the first Mixture-of-Experts (MoE)-enhanced Video-LLM that effectively decomposes VTG tasks by dynamically routing task-specific tokens (e.g., timestamps, saliency scores) to specialized experts, with increased computational efficiency. Our design choice enables precise handling of each subtask, leading to improved event modeling across diverse VTG applications. Extensive experiments show that TimeExpert consistently achieves state-of-the-art performance on various fine-grained VTG tasks such as dense video captioning, moment retrieval, and video highlight detection. Our model and code will be publicly available.

Live content is unavailable. Log in and register to view live content