Poster
Auto-Controlled Image Perception in MLLMs via Visual Perception Tokens
Runpeng Yu · Xinyin Ma · Xinchao Wang
In MLLMs, Visual perception refers to the process by which MLLMs encode visual inputs, such as images, and align them with the text embedding space. Currently, MLLMs still lack the capability to autonomously control their own visual perception processes. For example, they cannot selectively re-encode specific regions of an image or focus on information related to specific object categories.In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. We design two types of Visual Perception Tokens, termed the Region Selection Token and the Vision Re-Encoding Token. MLLMs autonomously generate these tokens, just as they generate natural language tokens, and use them to trigger additional visual perception process. The Region Selection Token explicitly identifies regions of interest that require further processing, while the Vision Re-Encoding Token utilizes its hidden states to guide an additional vision encoding process. Extensive experiments highlight the effectiveness of these tokens in enhancing spatial reasoning, fine-grained understanding, Text/OCR-related VQA, and a wide range of other visual tasks. On average, the introduction of Visual Perception Tokens improves the performance of a 2B model by 30.9%, increasing its score from 0.572 to 0.749, and even outperforms a 7B parameter model by 20.0% (from 0.624).
Live content is unavailable. Log in and register to view live content