Poster
Vid-Group: Temporal Video Grounding Pretraining from Unlabeled Videos in the Wild
Peijun Bao · Chenqi Kong · SIYUAN YANG · Zihao Shao · Xinghao Jiang · Boon Ng · Meng Er · Alex Kot
Temporal video grounding aims to localize the described temporal moment in an untrimmed video based on a natural language query. A major challenge of this task is its heavy reliance on labor-intensive annotations for training. Unlike existing works that directly train models on manually curated data, we propose a novel paradigm to reduce annotation costs: pretraining the model on unlabeled, real-world videos. Although this dataset is not perfectly accurate, it is easily scalable without requiring extensive manual effort. To support this, we introduce Temporal Video Grounding Pretraining (Vid-Group), a large-scale dataset collected with minimal human intervention, consisting of over 50K videos captured in the wild and 200K pseudo annotations. Direct pretraining on these imperfect pseudo annotations, however, presents significant challenges, including mismatched sentence-video pairs and imprecise temporal boundaries. To address these issues, we propose the ReCorrect algorithm, which comprises two main phases: semantics-guided refinement and memory-consensus correction. The semantics-guided refinement enhances the pseudo labels by leveraging semantic similarity with video frames to clean out unpaired data and make initial adjustments to temporal boundaries. In the following memory-consensus correction phase, a memory bank tracks the model predictions, progressively correcting the temporal boundaries based on consensus within the memory. Comprehensive experiments demonstrate ReCorrect's strong generalization abilities across multiple downstream settings. The code, dataset, and pretrained models are available at https://anonymous.4open.science/r/Vid-Group.
Live content is unavailable. Log in and register to view live content