Skip to yearly menu bar Skip to main content


Poster

QK-Edit: Revisiting Attention-based Injection in MM-DiT for Image and Video Editing

Tiancheng SHEN · Jun Hao Liew · Zilong Huang · Xiangtai Li · Zhijie Lin · Jiyang Liu · Yitong Wang · Jiashi Feng · Ming-Hsuan Yang


Abstract:

Multimodal Diffusion Transformers (MM-DiTs) have recently emerged as a powerful framework for unified text-vision synthesis, surpassing traditional U-Net architectures in generative tasks. One key innovation lies in its Multimodal Self-Attention (MM-SA) interaction where image and text tokens are concatenated and processed via self-attention.However, this mechanism poses significant challenges for editing, rendering conventional U-Net-based attention manipulation methods ineffective. To address this limitation, we propose QK-Edit, a training-free framework that exploits the unique attention dynamics of MM-DiTs for precise text-guided image and video editing. By introducing a novel query-key manipulation strategy, our method isolates and adjusts critical attention components to achieve an optimal balance between prompt fidelity and structural consistency. This enables seamless edits across various tasks, including object addition, object removal, object replacement, changing background, changing material, changing color, and style transformation. Notably, it can be easily implemented with feature replacement in inference.QK-Edit demonstrates superior editing performance on state-of-the-art models, such as FLUX and HunyuanVideo, effectively bridging the gap between generative power and editable flexibility in MM-DiTs, and paving the way for scalable multimodal content creation. The code will be made publicly available.

Live content is unavailable. Log in and register to view live content