Poster
LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation
Jiahao Wang · Ning Kang · Lewei Yao · Mengzhao Chen · Chengyue Wu · Songyang Zhang · Shuchen Xue · Yong Liu · Taiqiang Wu · Xihui Liu · Kaipeng Zhang · Shifeng Zhang · Wenqi Shao · Zhenguo Li · Ping Luo
In this paper, we investigate how to convert a pre-trained Diffusion Transformer (DiT) into a linear DiT, as its simplicity, parallelism, and efficiency for image generation. Through detailed exploration, we offer a suite of ready-to-use solutions, ranging from linear attention design to optimization strategies. Our core contributions include 5 practical guidelines: 1) Applying depth-wise convolution within simple linear attention is sufficient for image generation. 2) Using fewer heads in linear attention provides a free-lunch performance boost without increasing latency. 3) Inheriting weights from a fully converged, pre-trained DiT. 4) Loading all parameters except those related to linear attention. 5) Hybrid knowledge distillation: using a pre-trained teacher DiT to help the training of the student linear DiT, supervising not only the predicted noise but also the variance of the reverse diffusion process. These guidelines lead to our proposed Linear Diffusion Transformer (LiT), which serves as a safe and efficient alternative baseline for DiT with pure linear attention. In class-conditional 256×256 and 512×512 ImageNet generation, LiT can be quickly adapted from DiT using only 20% and 33% of DiT’s training steps, respectively, while achieving comparable performance. LiT also rivals methods based on Mamba or Gated Linear Attention. Moreover, the same guidelines generalize to text-to-image generation: LiT can be swiftly converted from PixArt-Σ to generate high-quality images, maintaining comparable GenEval scores. Additionally, LiT supports offline deployment on a laptop, enabling 1K resolution photorealistic image generation.
Live content is unavailable. Log in and register to view live content