ICCV Poster ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

Poster

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

Jiaqi Liao · Zhengyuan Yang · Linjie Li · Dianqi Li · Kevin Lin · Yu Cheng · Lijuan Wang

Exhibit Hall I #1587

[ Abstract ]

Wed 22 Oct 5:45 p.m. PDT — 7:45 p.m. PDT

Abstract:

In this work, we study the problem of Text-to-Image In-Context Learning (T2I-ICL).While Unified Multimodal LLMs (MLLMs) have advanced rapidly in recent years, they struggle with contextual reasoning in T2I-ICL scenarios. To address this limitation, we propose a novel framework that incorporates a thought process called ImageGen-CoT prior to image generation. However, we observe that MLLMs often produce unstructured reasoning steps, resulting in suboptimal outcomes. To tackle this issue, we develop an automatic pipeline to curate a high-quality ImageGen-CoT dataset. We then fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities. However, due to the complexity of T2I-ICL tasks, there is still significant room for improvement. To further enhance performance, we explore test-time scale-up strategies and propose a novel hybrid scaling approach. This approach first generates multiple ImageGen-CoT chains and then produces multiple images for each chain by varying the random seed. Extensive experiments demonstrate the effectiveness of our proposed method. Notably, fine-tuning with the ImageGen-CoT dataset leads to a substantial 80\% performance gain for SEED-X on T2I-ICL tasks.

Live content is unavailable. Log in and register to view live content