ICCV Poster UniversalBooth: Model-Agnostic Personalized Text-to-Image Generation

Poster

UniversalBooth: Model-Agnostic Personalized Text-to-Image Generation

Songhua Liu · Ruonan Yu · Xinchao Wang

Exhibit Hall I #1694

[ Abstract ]

Wed 22 Oct 5:45 p.m. PDT — 7:45 p.m. PDT

Abstract:

Given a source image, personalized text-to-image generation produces images preserving the identity and appearance while following the text prompts. Existing methods heavily rely on test-time optimization to achieve this customization. Although some recent works are dedicated to zero-shot personalization, they still require re-training when applied to different text-to-image diffusion models. In this paper, we instead propose a model-agnostic personalized method termed UniversalBooth. At the heart of our approach lies a novel cross-attention mechanism, where different blocks in the same diffusion scale share common square mappings for key and value, which decouples the image feature encoder from the diffusion architecture while maintaining its effectiveness. Moreover, the cross-attention performs hierarchically: the holistic attention first captures the global semantics of user inputs for textual combination with editing prompts, and the fine-grained attention divides the holistic attention scores for various local patches to enhance appearance consistency. To improve the performance when deployed on unseen diffusion models, we further devise an optimal transport prior to the model and encourage the attention scores allocated by cross-attention to fulfill the optimal transport constraint. Experiments demonstrate that our personalized generation model can be generalized to unseen text-to-image diffusion models with a wide spectrum of architectures and functionalities without any additional optimization, while other methods cannot. Meanwhile, it achieves comparable zero-shot personalization performance on seen architectures with existing works.

Live content is unavailable. Log in and register to view live content