Poster
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Rongyao Fang · Chengqi Duan · Kun Wang · Hao Li · Linjiang Huang · Hao Tian · Xingyu Zeng · Rui Zhao · Jifeng Dai · Hongsheng Li · Xihui Liu
Recent advancements in multimodal foundation models have yielded significant progress in vision-language understanding. Initial attempts have also explored the potential of multimodal large language models for visual content generation. However, existing approaches face a trade-off between generation diversity and controllability, struggling to meet the varying granularity demands of different image generation tasks within a unified MLLM framework. In this work, we propose PUMA, emPowering Unified MLLM with Multi-grAnular visual generation, a novel paradigm that tackles the diversity-controllability trade-off. PUMA achieves this by unifying multi-granular visual features as both inputs and outputs of MLLMs, thus effectively meeting the distinct granularity needs for diverse generation and precise manipulation within a single framework. Following multimodal pretraining and instruction tuning, PUMA demonstrates remarkable capabilities in a wide range of multimodal tasks, including image understanding, diverse text-to-image generation, editing, inpainting, colorization, and conditional generation. This work marks a significant stride towards realizing truly unified MLLMs capable of seamlessly adapting to the diverse granularity demands and task requirements inherent in various visual tasks. The code and model will be released upon acceptance.
Live content is unavailable. Log in and register to view live content