ICCV Poster $I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting

Poster

$I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting

Zhimin Liao · Ping Wei · Ruijie Zhang · Shuaijia Chen · Haoxuan Wang · Ziyang Ren

[ Abstract ]

Abstract: Forecasting the evolution of 3D scenes and generating unseen scenarios through occupancy-based world models offers substantial potential to enhance the safety of autonomous driving systems. While tokenization has revolutionized image and video generation, efficiently tokenizing complex 3D scenes remains a critical challenge for 3D world models. To address this, we propose $I^{2}$-World, an efficient framework for 4D occupancy forecasting. Our method decouples scene tokenization into intra-scene and inter-scene tokenizers. The intra-scene tokenizer employs a multi-scale residual quantization strategy to hierarchically compress 3D scenes while preserving spatial details. The inter-scene tokenizer residually aggregates temporal dependencies across timesteps. This dual design retains the compactness of 3D tokenizers while capturing the dynamic expressiveness of 4D approaches. Unlike decoder-only GPT-style autoregressive models, $I^{2}$-World adopts an encoder-decoder architecture. The encoder aggregates spatial context from the current scene and predicts a transformation matrix to guide future scene generation. The decoder, conditioned on this matrix and historical tokens, ensures temporal consistency during generation. Experiments demonstrate that $I^{2}$-World achieves state-of-the-art performance, surpassing existing approaches by $\textbf{41.8}$% in 4D occupancy forecasting with exceptional efficiency—requiring only $\textbf{2.9 GB}$ of training memory and achieving real-time inference at $\textbf{94.8 FPS}$.

Live content is unavailable. Log in and register to view live content