Skip to yearly menu bar Skip to main content


Poster

STIV: Scalable Text and Image Conditioned Video Generation

Zongyu Lin · Wei Liu · Chen Chen · Jiasen Lu · Wenze Hu · Tsu-Jui Fu · Jesse Allardice · Zhengfeng Lai · Liangchen Song · Bowen Zhang · cha chen · Yiran Fei · Lezhi Li · Yizhou Sun · Kai-Wei Chang · Yinfei Yang


Abstract:

We present a simple and scalable text and image conditioned video generation method. Our approach, named STIV, integrates a variable number of image conditions into a Diffusion Transformer (DiT) through frame replacement. This design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously, as well as long video generation through autoregressive rollouts.Additionally, STIV can be easily extended to various applications, such as video prediction, frame interpolation, and multi-view generation, etc.With comprehensive ablation studies on T2I, T2V, TI2V, and long video generation, STIV demonstrate strong performance, despite its simple design. An 8.7B model with (512^2) resolution achieves 83.1 on VBench T2V, surpassing both leading open and closed-source models like CogVideoX-5B, Pika, Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result of 90.1 on VBench I2V task at (512^2) resolution. Combine all of these, we finally scale up our model to 540p with over 200 frames. By providing a transparent recipe for building cutting-edge video generation models, we aim to empower future research and accelerate progress for video generation.

Live content is unavailable. Log in and register to view live content