Poster
FullDiT: Video Generative Foundation Models with Multimodal Control via Full Attention
Xuan Ju · Weicai Ye · Quande Liu · Qiulin Wang · Xintao Wang · Pengfei Wan · Di ZHANG · Kun Gai · Qiang Xu
Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they suffer from three key limitations: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions—including text, camera, identities, and depth—via full-attention mechanisms. By directly fusing multimodal conditions into a unified sequence representation, FullDiT significantly reduces parameter overhead, avoids conflicts common in adapter-based methods, and shows scalability and emergent ability. We further introduce FullBench, a new benchmark designed specifically for multi-condition video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of unified full-attention in complex multimodal video tasks.
Live content is unavailable. Log in and register to view live content