Skip to yearly menu bar Skip to main content


Poster

Token-Efficient VLM: High-Resolution Image Understanding via Dynamic Region Proposal

yitong jiang · Jinwei Gu · Tianfan Xue · Ka Chun Cheung · Pavlo Molchanov · Hongxu Yin · Sifei Liu


Abstract:

Vision-Language Models (VLMs) excel at visual understanding by leveraging pretrained image encoders to generate visual tokens. However, they struggle with high-resolution images and zoomed-in regions due to the computational burden and token redundancy of uniform patch-based processing, often leading to the loss of critical details. To address these challenges, we propose Token-Efficient Vision Language Model (TEVA), a novel framework that detects key regions and applies dynamic patch sampling to efficiently capture fine-grained details while preserving global context. Our approach first identifies subject-oriented regions using an adaptive detection strategy. Then, a dynamic patch sampling mechanism selects and arranges patches at varying scales, ensuring efficient processing without increasing token count. Extensive experiments demonstrate that Token-Efficient Vision Language Model (TEVA) significantly enhances VLM performance in handling visual details, seamlessly integrating with various decoders and LLMs. Code and dataset will be released upon publication.

Live content is unavailable. Log in and register to view live content