Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
Abstract
Video large language models (LLMs) have achieved good video understanding performance by utilizing a large number of tokens in spatio-temporal space. However, the quadratic growth of the computational complexity associated with the number of tokens remains a critical challenge. To address this, we propose a novel spatio-temporal token merging (STTM) designed to enhance token efficiency in video LLMs. Our key insight is to leverage inherent spatial and temporal local redundancy in video data, which has been overlooked in previous research. Specifically, we transform individual frames into multi-granular spatial tokens, by coarse-to-fine search algorithm based on the quadtree data structure. Subsequently, we perform multi-granular directed pairwise merging in the temporal dimension. This decomposed merging approach significantly reduces redundant visual tokens across spatio-temporal dimension. Experiments on multiple video QA benchmarks show that our approach outperforms existing token reduction methods in accuracy. Surprisingly, our approach maintains above 99\% relative accuracy to models using full tokens with only 50\% of token budget. This token reduction also translates to lower inference latency.