Skip to yearly menu bar Skip to main content


Poster

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Rongyao Fang · Chengqi Duan · Kun Wang · Hao Li · Linjiang Huang · Hao Tian · Xingyu Zeng · Rui Zhao · Jifeng Dai · Hongsheng Li · Xihui Liu


Abstract:

Recent advancements in multimodal foundation models have yielded significant progress in vision-language understanding. Initial attempts have also explored the potential of multimodal large language models for visual content generation. However, existing approaches face a trade-off between generation diversity and controllability, struggling to meet the varying granularity demands of different image generation tasks within a unified MLLM framework. In this work, we propose PUMA, emPowering Unified MLLM with Multi-grAnular visual generation, a novel paradigm that tackles the diversity-controllability trade-off. PUMA achieves this by unifying multi-granular visual features as both inputs and outputs of MLLMs, thus effectively meeting the distinct granularity needs for diverse generation and precise manipulation within a single framework. Following multimodal pretraining and instruction tuning, PUMA demonstrates remarkable capabilities in a wide range of multimodal tasks, including image understanding, diverse text-to-image generation, editing, inpainting, colorization, and conditional generation. This work marks a significant stride towards realizing truly unified MLLMs capable of seamlessly adapting to the diverse granularity demands and task requirements inherent in various visual tasks. The code and model will be released upon acceptance.

Live content is unavailable. Log in and register to view live content