Skip to yearly menu bar Skip to main content


Poster

VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges

Yuxuan Wang · Yiqi Song · Cihang Xie · Yang Liu · Zilong Zheng


Abstract: Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers.In this work, we introduce VideoLLaMB, a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences alongside historical visual data, effectively preserving semantic continuity and enhancing model performance across various tasks.This approach includes recurrent memory tokens and a SceneTilling algorithm, which segments videos into independent semantic units to preserve semantic integrity. Empirically, VideoLLaMB significantly outstrips existing video-language models, demonstrating a $4.2$ points improvement over its competitors across four VideoQA benchmarks, and $2.06$ points on egocentric planning. Remarkably, it maintains robust performance as PLLaVA even as video length increases up to $8\times$. Besides, the frame retrieval results on our specialized \textbf{Needle in a Video Haystack (NIAVH)} benchmark, further validate VideoLLaMB's prowess in accurately identifying specific frames within lengthy videos. Our SceneTilling algorithm also enables the generation of streaming video captions directly, without necessitating additional training. In terms of efficiency, VideoLLaMB trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU with linear GPU memory scaling, ensuring both high performance and cost-effectiveness

Live content is unavailable. Log in and register to view live content