Skip to yearly menu bar Skip to main content


Poster

X-Prompt: Generalizable Auto-Regressive Visual Learning with In-Context Prompting

Zeyi Sun · Ziyang Chu · Pan Zhang · Tong Wu · Xiaoyi Dong · Yuhang Zang · Yuanjun Xiong · Dahua Lin · Jiaqi Wang


Abstract:

Recent advances in large language models have enabled task prompting for open-ended text generation. In the vision domain, a longstanding goal is developing models capable of general visual learning, encompassing tasks such as image generation, editing, low-level processing, and dense perception. Although recent efforts have aimed at building vision foundation models that support prompting, significant challenges remain, particularly in accurately comprehending visual prompts and addressing the ambiguity inherent in textual prompts. To address this, we introduce X-Prompt, a purely auto-regressive large vision-language model designed for generalizable visual learning via in-context prompting. X-Prompt can process visual and textual prompts as context, enabling precise task interpretation and accurate execution. A novel prompt-token fusion mechanism effectively extracts relevant task information from complex prompts while significantly reducing the token length. Additionally, a unified training strategy for text and image prediction enhances task awareness, enabling seamless adaptation to open-ended prompts. Extensive experiments demonstrate that X-Prompt effectively interprets in-context prompts and exhibits generalization across both in-domain and out-of-domain visual tasks, paving the way for future advancements in general visual learning.

Live content is unavailable. Log in and register to view live content