Poster
DisTime: Distribution-based Time Tokenizer for Temporal Localization with Video Large Language Model
yingsen zeng · Zepeng Huang · Yujie Zhong · Chengjian Feng · Jie Hu · Lin Ma · Yang Liu
Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. This approach uses a learnable token to create a continuous embedding space for all time points and incorporates a Distribution-based Time Tokenizer that decodes timestamps into probability distributions. These distributions effectively resolve boundary ambiguities and translate into continuous time values. Additionally, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models to overcome temporal granularity limitations in existing datasets. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks.
Live content is unavailable. Log in and register to view live content