Skip to yearly menu bar Skip to main content


Poster

Text-guided Visual Prompt DINO for Generic Segmentation

Yuchen Guan · Chong Sun · Canmiao Fu · Zhipeng Huang · Chun Yuan · Chen Li


Abstract:

Recent advancements in multimodal vision models have highlighted limitations in late-stage feature fusion and suboptimal query selection for hybird prompts open-world segmentation, alongside constraints from caption-derived vocabularies. To address these challenges, we propose \modelName, a text-guided visual Prompt DINO framework featuring three key innovations. First, we introduce an early fusion mechanism that unifies text/visual prompts, and backbone features at the initial encoding stage, enabling deeper cross-modal interactions to resolve semantic ambiguities. Second, we design order-aligned query selection for DETR-based architectures, explicitly optimizing the structural alignment between text and visual queries during decoding to enhance semantic-spatial consistency. Third, we develop a generative data engine powered by the \textit{\rapLongName (\rapName)} model, which synthesizes 0.5B diverse training instances through a dual-path cross-verification pipeline, reducing label noise by 80.5\% compared to conventional approaches. Extensive experiments demonstrate that \modelName achieves state-of-the-art performance on open-world detection benchmarks while significantly expanding semantic coverage beyond fixed-vocabulary constraints. Our work establishes a new paradigm for scalable multimodal detection and data generation in open-world scenarios. Data\&Code will be made available.

Live content is unavailable. Log in and register to view live content