Poster
Learning Beyond Still Frames: Scaling Vision-Language Models with Video
Yiyuan Zhang · Handong Li · Jing Liu · Xiangyu Yue
High-quality image-text data is critical in enhancing Vision-Language Models (VLMs), but traditional image-based pretraining approaches face limitations. These methods are resource-intensive, relying on curated, high-quality interleaved data that is costly and challenging to collect at scale. Additionally, while such datasets improve static image-text understanding, they fail to develop the temporal and motion comprehension needed for video understanding. To address these gaps, we propose incorporating video pretraining into VLMs to improve the model’s ability to capture temporal dynamics and general visual perception, which requires reconciling spatial redundancy with strict temporal causality. Therefore, we propose Causal Hierarchical Aggregation to separate computation-heavy spatial encoding from lightweight temporal propagation and construct hierarchical receptive fields at varying granularities. As we scale video context to more than 100B tokens, our method excels in high throughput and state-of-the-art performances on both Image and Video understanding, as shown in Figure 1, providing a scalable solution to enhance multimodal learning in dynamic contexts.
Live content is unavailable. Log in and register to view live content