Skip to yearly menu bar Skip to main content


Poster

IRASim: A Fine-Grained World Model for Robot Manipulation

Fangqi Zhu · Hongtao Wu · Song Guo · Yuxiao Liu · Chilam Cheang · Tao Kong


Abstract:

World models allow autonomous agents to plan and explore by predicting the visual outcomes of different actions. However, for robot manipulation, it is challenging to accurately model the fine-grained robot-object interaction within the visual space using existing methods which overlooks precise alignment between each action and the corresponding frame.In this paper, we present IRASim, a novel world model capable of generating videos with fine-grained robot-object interaction details, conditioned on historical observations and robot action trajectories.We train a diffusion transformer and introduce a novel frame-level action-conditioning module within each transformer block to explicitly model and strengthen the action-frame alignment.Extensive experiments show that: (1) the quality of the videos generated by our method surpasses all the comparing baseline methods and scales effectively with increased model size and computation;(2) policy evaluations using IRASim exhibit a strong correlation with those using the ground-truth simulator, highlighting its potential to accelerate real-world policy evaluation; (3) testing-time scaling through model-based planning with IRASim significantly enhances policy performance, as evidenced by an improvement in the IoU metric on the Push-T benchmark from 0.637 to 0.961;(4) IRASim provides flexible action controllability, allowing virtual robotic arms in datasets to be controlled via a keyboard or VR controller. Video and code are available at https://iccv-2025-13322.github.io/.

Live content is unavailable. Log in and register to view live content