ICCV Poster RoboAnnotatorX: A Comprehensive and Universal Annotation Framework for Accurate Understanding of Long-horizon Robot Demonstration

Poster

RoboAnnotatorX: A Comprehensive and Universal Annotation Framework for Accurate Understanding of Long-horizon Robot Demonstration

Longxin Kou · Fei Ni · Jianye HAO · Han Peilong · Jinyi Liu · Haiqin Cui · Rui Liu · YAN ZHENG

Exhibit Hall I #947

[ Abstract ] [ Project Page ]

Wed 22 Oct 2:15 p.m. PDT — 4:15 p.m. PDT

Abstract:

Recent advances in robotics have produced numerous valuable large-scale demonstration datasets, yet their potential remains underutilized due to annotation limitations. Current datasets often suffer from sparse temporal annotations, and inconsistent labeling granularity, particularly for complex long-horizon demonstrations. Traditional manual annotation methods are expensive and poorly scalable while existing automated methods struggle with temporal coherence and semantic richness across extended demonstrations. For this, we propose RoboAnnotatorX, a reliable annotation tool that enhances multimodal large language model to generate high-quality, context-rich annotations for complex long-horizon demonstrations. Specifically, we introduce a multi-scale token-efficient encoder to maintain computational efficiency while simultaneously capturing fine-grained visual details and preserving temporal information by jointly integrating scene-level anchoring, clip-level temporal dynamics, and video-level global modeling. We further construct a comprehensive dataset RoboX-VQA that synthesizes diverse QA pairs from both real-world and simulated data, bridging the significant domain gap in robotics demonstrations. Moreover, we leverage a curriculum-inspired three-stage training to progressively develop capabilities from basic visual perception to sophisticated temporal reasoning. Extensive experiments demonstrate that RoboAnnotatorX significantly outperforms existing approaches in annotation quality and exhibits strong generalization across diverse robotic environments, helping unlock the full potential of existing robotic datasets.

Live content is unavailable. Log in and register to view live content