Skip to yearly menu bar Skip to main content


Poster

SparseVILA: Query-Aware Visual Sparsity Should Happen at Decoding

Samir Khaki · Junxian Guo · Jiaming Tang · Shang Yang · Yukang Chen · Konstantinos Plataniotis · Yao Lu · Song Han · Zhijian Liu


Abstract: Vision language models (VLMs) have garnered increasing attention for their ability to integrate visual and textual understanding, with some capable of processing native-resolution images and long videos. While the capacity to process large visual data unlocks numerous downstream applications, it often introduces significant latency challenges, as the visual tokens dominate the resource consumption. In this work, we introduce SparseVILA, a novel method of query-aware token retrieval to dynamically accelerate the underlying LLM, by pruning tokens in the context stage, while attending to a sparse subset of visual tokens during the generation phase. By decoupling the context and generation compression, we can migrate the majority of sparsity into the generation stage, enabling query-aware support for multi-turn conversation while achieving a 1.5$\times$ speedup on image benchmarks. Further, this approach leads to significant accuracy improvements on image-centric benchmarks over previous query-aware/agnostic pruning works. Finally, SparseVILA enables efficient long-context/long-generation tasks by achieving a 6.3$\times$ and 1.7$\times$ speedup in context processing and generation respectively.

Live content is unavailable. Log in and register to view live content