ICCV Poster Training-free and Adaptive Sparse Attention for Efficient Long Video Generation

Poster

Training-free and Adaptive Sparse Attention for Efficient Long Video Generation

yifei xia · Suhan Ling · Fangcheng Fu · Yujie Wang · Huixia Li · Xuefeng Xiao · Bin CUI

Exhibit Hall I #1469

[ Abstract ]

Wed 22 Oct 5:45 p.m. PDT — 7:45 p.m. PDT

Abstract: Generating high-quality long videos with Diffusion Transformers (DiTs) faces significant latency due to computationally intensive attention mechanisms. For instance, generating an 8s 720p video (110K tokens) with HunyuanVideo requires around 600 PFLOPs, with attention computations consuming about 500 PFLOPs.To tackle this, we propose **AdaSpa**, the first **Dynamic Pattern** and **Online Precise Search** sparse attention method for DiTs. First, AdaSpa uses a blockified pattern to efficiently represent the hierarchical sparsity inherent in DiTs, significantly reducing attention complexity while preserving video fidelity. This is motivated by our observation that DiTs' sparsity exhibits hierarchical and blockified structures across modalities.Second, AdaSpa introduces Fused LSE-Cached Search with Head-Adaptive Block Sparse Attention for efficient online precise search and computation. This approach leverages the invariance of sparse patterns and LSE across denoising steps, allowing precise real-time identification of sparse patterns with minimal overhead.AdaSpa is an **adaptive, plug-and-play solution** that seamlessly integrates into existing DiT models without additional training or data profiling. Extensive experiments validate that AdaSpa significantly accelerates video generation from 1.59$\times$ to 2.04$\times$ while maintaining video quality, demonstrating strong effectiveness.

Live content is unavailable. Log in and register to view live content