ICCV Poster HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

Poster

HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

Yusen Zhang · Wenliang Zheng · Aashrith Madasu · Peng Shi · Ryo Kamoi · Hao Zhou · Zhuoyang Zou · Shu Zhao · Sarkar Snigdha Sarathi Das · Vipul Gupta · Xiaoxin Lu · Nan Zhang · Ranran Zhang · Avitej Iyer · Renze Lou · Wenpeng Yin · Rui Zhang

Exhibit Hall I #2115

[ Abstract ]

Thu 23 Oct 2:15 p.m. PDT — 4:15 p.m. PDT

Abstract: High-resolution image (HRI) understanding aims to process images with a large number of pixels such as pathological images and agricultural aerial images, both of which can exceed 1 million pixels. Vision Large Language Models (VLMs) typically handle higher-resolution images through dynamic patching. However, there is a lack of a comprehensive benchmark for VLMs to evaluate HRI understanding, leaving this domain underexplored. To address this gap, we introduce HRScene, a novel unified benchmark for HRI understanding with rich scenes. HRScene incorporates 25 real-world datasets and 2 synthetic diagnostic datasets with resolutions ranging from 1,024 $\times$ 1,024 to 35,503 $\times$ 26,627. HRScene is collected and re-annotated by 10 graduate-level annotators, covering 25 scenarios, ranging from microscopic and radiology images to street views, long-range pictures, and telescope images. It includes high-resolution images of real-world objects, scanned documents, and composite multi-image. The two diagnostic evaluation datasets are synthesized by combining the target image with the gold answer and similar distracting images in different orders. These datasets assess how well models utilize HRI by comparing performance across different image regions. We conduct extensive experiments involving 27 VLMs, including Gemini 2.0 Pro and GPT-4o. Experiments on HRScene show that current VLMs achieve an average accuracy of around 50\% on real-world tasks, revealing significant gaps in HRI understanding. Results on our synthetic datasets reveal that VLMs struggle to effectively utilize HRI regions compared to low-resolution images, with a gap exceeding 20\%. Our code and data will be publicly available.

Live content is unavailable. Log in and register to view live content