Poster
Streaming VideoLLMs for Real-Time Procedural Video Understanding
Dibyadip Chatterjee · Edoardo Remelli · Yale Song · Bugra Tekin · Abhay Mittal · Bharat Bhatnagar · Necati Cihan Camgoz · Shreyas Hampali · Eric Sauser · Shugao Ma · Angela Yao · Fadime Sener
[
Abstract
]
Abstract:
We introduce ProVideLLM, an end-to-end framework for real-time streaming procedural assistance. ProVideLLM integrates a multimodal cache configured to store two types of tokens -- verbalized text tokens, which provide compressed textual summaries of long-term observations, and visual tokens, encoded with DETR-QFormer to capture fine-grained details from short-term observations. This design reduces token count by $22\times$ over existing methods in representing one hour of long-term observations while effectively encoding fine-grained representations. By interleaving these tokens in the multimodal cache, ProVideLLM ensures sub-linear scaling of memory and compute with video length, enabling per-frame streaming inference at 10 FPS and 25 FPS for streaming dialogue, with a minimal 2GB GPU memory footprint. ProVideLLM also sets new state-of-the-art results on six procedural tasks across four datasets.
Live content is unavailable. Log in and register to view live content