ICCV Poster MotionCtrl: A Real-time Controllable Vision-Language-Motion Model

Poster

MotionCtrl: A Real-time Controllable Vision-Language-Motion Model

Bin Cao · Sipeng Zheng · Ye Wang · Lujie Xia · Qianshan Wei · Qin Jin · Jing Liu · Zongqing Lu

Exhibit Hall I #1126

[ Abstract ]

Wed 22 Oct 2:15 p.m. PDT — 4:15 p.m. PDT

Abstract:

Human motion generation holds significant potential for real-world applications. Despite recent advancements, existing vision-language-motion models (VLMMs) remain limited in achieving this goal. In this paper, we identify the lack of controllability as a critical bottleneck, where VLMMs struggle with diverse human commands, pose initialization, generation of long-term or unseen cases, and fine-grained control over individual body parts.To address these challenges, we introduce MotionCtrl, the first real-time, controllable VLMM with state-of-the-art performance.MotionCtrl achieves its controllability through training on HuMo100M, the largest human motion dataset to date, featuring over 5 million self-collected motions, 100 million multi-task instructional instances, and detailed part-level descriptions that address a long-standing gap in the field. Additionally, we propose a novel part-aware residual quantization technique for motion tokenization, enabling precise control over individual body parts during motion generation.Extensive experiments demonstrate MotionCtrl's superior performance across a wide range of motion benchmarks.Furthermore, we provide strategic design insights and a detailed time efficiency analysis to guide the development of practical motion generators. We believe the release of HuMo100M and MotionCtrl will significantly advance the motion community toward real-life applications. Code and data will be available at \url{https://anonymous.4open.science/r/MotionCtrl}.

Live content is unavailable. Log in and register to view live content