ICCV Poster Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

Poster

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

Qi Qin · Le Zhuo · Yi Xin · Ruoyi Du · Zhen Li · Bin Fu · Yiting Lu · Xinyue Li · Dongyang Liu · Xiangyang Zhu · Will Beddow · Erwann Millon · Victor Perez · Wenhai Wang · Yu Qiao · Bo Zhang · Xiaohong Liu · Hongsheng Li · Chang Xu · Peng Gao

[ Abstract ]

Abstract:

We introduce Lumina-Image 2.0, an advanced text-to-image (T2I) model that surpasses previous state-of-the-art methods across multiple benchmarks. Lumina-Image 2.0 is characterized by two key features: (1) Unification – it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task expansion. Besides, since high-quality captioners can provide semantically well-aligned text-image training pairs, we introduce a unified captioning system, Unified Captioner (UniCap), which can generate detailed and accurate multilingual captions for our model. This not only accelerates model convergence, but also enhances prompt adherence, multi-granularity prompt handling, and task expansion with customized prompt templates. (2)Efficiency – to improve the efficiency of our proposed model, we develop multi-stage progressive training strategies to optimize our model, alongside inference-time acceleration strategies without compromising image quality. We evaluate our model on academic benchmarks and T2I arenas, with results confirming that it matches or exceeds existing state-of-the-art models across various metrics, highlighting the effectiveness of our methods.

Live content is unavailable. Log in and register to view live content