Streaming VideoLLMs for Real-Time Procedural Video Understanding
Dibyadip Chatterjee ⋅ Edoardo Remelli ⋅ Yale Song ⋅ Bugra Tekin ⋅ Abhay Mittal ⋅ Bharat Bhatnagar ⋅ Necati Cihan Camgoz ⋅ Shreyas Hampali ⋅ Eric Sauser ⋅ Shugao Ma ⋅ Angela Yao ⋅ Fadime Sener
Abstract
We introduce ProVideLLM, an end-to-end framework for real-time streaming procedural assistance. ProVideLLM integrates a multimodal cache configured to store two types of tokens -- verbalized text tokens, which provide compressed textual summaries of long-term observations, and visual tokens, encoded with DETR-QFormer to capture fine-grained details from short-term observations. This design reduces token count by $22\times$ over existing methods in representing one hour of long-term observations while effectively encoding fine-grained representations. By interleaving these tokens in the multimodal cache, ProVideLLM ensures sub-linear scaling of memory and compute with video length, enabling per-frame streaming inference at 10 FPS and 25 FPS for streaming dialogue, with a minimal 2GB GPU memory footprint. ProVideLLM also sets new state-of-the-art results on six procedural tasks across four datasets.
Chat is not available.
Successful Page Load