Skip to yearly menu bar Skip to main content


Poster

Vid-Group: Temporal Video Grounding Pretraining from Unlabeled Videos in the Wild

Peijun Bao · Chenqi Kong · SIYUAN YANG · Zihao Shao · Xinghao Jiang · Boon Ng · Meng Er · Alex Kot


Abstract:

Temporal video grounding aims to localize the described temporal moment in an untrimmed video based on a natural language query. A major challenge of this task is its heavy reliance on labor-intensive annotations for training. Unlike existing works that directly train models on manually curated data, we propose a novel paradigm to reduce annotation costs: pretraining the model on unlabeled, real-world videos. Although this dataset is not perfectly accurate, it is easily scalable without requiring extensive manual effort. To support this, we introduce Temporal Video Grounding Pretraining (Vid-Group), a large-scale dataset collected with minimal human intervention, consisting of over 50K videos captured in the wild and 200K pseudo annotations. Direct pretraining on these imperfect pseudo annotations, however, presents significant challenges, including mismatched sentence-video pairs and imprecise temporal boundaries. To address these issues, we propose the ReCorrect algorithm, which comprises two main phases: semantics-guided refinement and memory-consensus correction. The semantics-guided refinement enhances the pseudo labels by leveraging semantic similarity with video frames to clean out unpaired data and make initial adjustments to temporal boundaries. In the following memory-consensus correction phase, a memory bank tracks the model predictions, progressively correcting the temporal boundaries based on consensus within the memory. Comprehensive experiments demonstrate ReCorrect's strong generalization abilities across multiple downstream settings. The code, dataset, and pretrained models are available at https://anonymous.4open.science/r/Vid-Group.

Live content is unavailable. Log in and register to view live content