ICCV Poster REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers

Poster

REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers

Xingjian Leng · Jaskirat Singh · Yunzhong Hou · Zhenchang Xing · Saining Xie · Liang Zheng

Exhibit Hall I #1689

[ Abstract ]

Wed 22 Oct 5:45 p.m. PDT — 7:45 p.m. PDT

Abstract: In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training for both VAE and diffusion-model using standard diffusion-loss is ineffective, causing the VAE to converge to trivial solutions and degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss $-$ allowing both encoder and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over $17\times$ and $45\times$ over REPA and vanilla training recipes, respectively. Interestingly, we observe that once tuned from the end-to-end training, the VAE can be reused for downstream generation tasks; exhibiting significantly accelerated generation performance across diverse diffusion architectures and training settings.

Live content is unavailable. Log in and register to view live content