ICCV Poster DLFR-Gen: Diffusion-based Video Generation with Dynamic Latent Frame Rate

Poster

DLFR-Gen: Diffusion-based Video Generation with Dynamic Latent Frame Rate

Zhihang Yuan · Rui Xie · Yuzhang Shang · Hanling Zhang · Siyuan Wang · Shengen Yan · Guohao Dai · Yu Wang

[ Abstract ]

Abstract:

Diffusion Transformer (DiT)-based generation models have achieved remarkable success in video generation. However, their inherent computational demands pose significant efficiency challenges. In this paper, we exploit the inherent temporal non-uniformity of real-world videos and observe that videos exhibit dynamic information density, with high-motion segments demanding greater detail preservation than static scenes. Inspired by this temporal non-uniformity, we propose DLFR-Gen, a training-free approach for Dynamic Latent Frame Rate Generation in Diffusion Transformers. DLFR-Gen adaptively adjusts the number of elements in latent space based on the motion frequency of the latent space content, using fewer tokens for low-frequency segments while preserving detail in high-frequency segments. Specifically, our key contributions are: A dynamic frame rate scheduler for DiT video generation that adaptively assigns frame rates for video segments. A novel latent-space frame merging method to align latent representations with their denoised counterparts before merging those redundant in low-resolution space. A preference analysis of Rotary Positional Embeddings (RoPE) across DiT layers, informing a tailored RoPE strategy optimized for semantic and local information capture. Experiments show that DLFR-Gen can achieve a speedup of up to 3 times for video generation with minimal quality degradation.

Live content is unavailable. Log in and register to view live content