Poster
Instruction-based Image Editing with Planning, Reasoning, and Generation
Liya Ji · Chenyang Qi · Qifeng Chen
Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation.Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only \textit{single} modality ability, restricting the editing quality.We aim to bridge understanding and generation via a new \textit{multi-modality} model that provides the intelligent abilities to instruction-based image editing models for more complex cases.To achieve this goal, we separate the instruction editing task with the multi-modality chain of thought prompts, \ie, Chain-of-Thought (CoT) planning, editing region reasoning, and editing, individually. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network.For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, for editing image generations, a hint-guided instruction-based editing network is proposed based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images. Source codes will be publicly available.
Live content is unavailable. Log in and register to view live content