Poster
LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models
Yu Cheng · Fajie Yuan
Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latent space. However, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose \textbf{LeanVAE}, a novel and ultra-efficient Video VAE framework that introduce two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE’s superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Video VAEs. Our model offers up to 50× fewer FLOPs and 44× faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient video generation. Our models and code will be made publicly available.
Live content is unavailable. Log in and register to view live content