Skip to yearly menu bar Skip to main content


Poster

DOGE : Towards Versatile Visual Document Grounding and Referring

Yinan Zhou · Yuxin Chen · Haokun Lin · Yichen Wu · Shuyu Yang · Zhongang Qi · Chen Ma · Li Zhu


Abstract:

With recent advances in Multimodal Large Language Models (MLLMs), grounding and referring capabilities have gained increasing attention for achieving detailed understanding and flexible user interaction. However, these capabilities still remain underdeveloped in visual document understanding due to the scarcity of fine-grained datasets and comprehensive benchmarks. To fill this gap, we propose the DOcument Grounding and rEferring data engine (DOGE-Engine), which generates two types of high-quality fine-grained document data: (1) multi-granular parsing data to improve text localization and recognition, and (2) instruction-tuning data to activate MLLMs' grounding and referring capabilities in dialogue and reasoning. Using the DOGE-Engine, we construct DOGE-Bench, a benchmark covering seven grounding and referring tasks across three document types (chart, poster, and PDF document), offering a comprehensive evaluation of fine-grained document understanding. Leveraging the generated data, we further develop DOGE, a strong baseline model that excels in text localization and recognition, while precisely grounds and refers to key textual information during conversation and reasoning, thereby advancing document understanding to a finer granularity and enable flexible interaction paradigms. Our code, data, and model will be open-sourced to support community development.

Live content is unavailable. Log in and register to view live content