ICCV Poster Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Poster

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Jeongseok Hyun · Sukjun Hwang · Su Ho Han · Taeoh Kim · Inwoong Lee · Dongyoon Wee · Joon-Young Lee · Seon Joo Kim · Minho Shim

Exhibit Hall I #2215

[ Abstract ] [ Project Page ]

Thu 23 Oct 2:15 p.m. PDT — 4:15 p.m. PDT

Abstract:

Video large language models (LLMs) have achieved good video understanding performance by utilizing a large number of tokens in spatio-temporal space. However, the quadratic growth of the computational complexity associated with the number of tokens remains a critical challenge. To address this, we propose a novel spatio-temporal token merging (STTM) designed to enhance token efficiency in video LLMs. Our key insight is to leverage inherent spatial and temporal local redundancy in video data, which has been overlooked in previous research. Specifically, we transform individual frames into multi-granular spatial tokens, by coarse-to-fine search algorithm based on the quadtree data structure. Subsequently, we perform multi-granular directed pairwise merging in the temporal dimension. This decomposed merging approach significantly reduces redundant visual tokens across spatio-temporal dimension. Experiments on multiple video QA benchmarks show that our approach outperforms existing token reduction methods in accuracy. Surprisingly, our approach maintains above 99\% relative accuracy to models using full tokens with only 50\% of token budget. This token reduction also translates to lower inference latency.

Live content is unavailable. Log in and register to view live content