ICCV Poster Streaming VideoLLMs for Real-Time Procedural Video Understanding

Poster

Streaming VideoLLMs for Real-Time Procedural Video Understanding

Dibyadip Chatterjee · Edoardo Remelli · Yale Song · Bugra Tekin · Abhay Mittal · Bharat Bhatnagar · Necati Cihan Camgoz · Shreyas Hampali · Eric Sauser · Shugao Ma · Angela Yao · Fadime Sener

Exhibit Hall I #2084

[ Abstract ] [ Project Page ]

Thu 23 Oct 2:15 p.m. PDT — 4:15 p.m. PDT

Abstract: We introduce ProVideLLM, an end-to-end framework for real-time streaming procedural assistance. ProVideLLM integrates a multimodal cache configured to store two types of tokens -- verbalized text tokens, which provide compressed textual summaries of long-term observations, and visual tokens, encoded with DETR-QFormer to capture fine-grained details from short-term observations. This design reduces token count by $22\times$ over existing methods in representing one hour of long-term observations while effectively encoding fine-grained representations. By interleaving these tokens in the multimodal cache, ProVideLLM ensures sub-linear scaling of memory and compute with video length, enabling per-frame streaming inference at 10 FPS and 25 FPS for streaming dialogue, with a minimal 2GB GPU memory footprint. ProVideLLM also sets new state-of-the-art results on six procedural tasks across four datasets.

Live content is unavailable. Log in and register to view live content