ICCV Poster Towards Long-Horizon Vision-Language-Action System: Reasoning, Acting and Memory

Poster

Towards Long-Horizon Vision-Language-Action System: Reasoning, Acting and Memory

Daixun Li · Yusi Zhang · Mingxiang Cao · donglai Liu · Weiying Xie · Tianlin Hui · Lunkai Lin · Zhiqiang Xie · Yunsong Li

[ Abstract ]

Abstract: Vision-Language-Action (VLA) is crucial for autonomous decision-making in embodied systems. While current methods have advanced single-skill abilities, their short-horizon capability limits applicability in real-world scenarios. To address this challenge, we innovatively propose $\textbf{MindExplore}$, a general hierarchical VLA system with cross-skill for long-horizon tasks in highly dynamic sand. The key insight is to iteratively align the knowledge domain of task planning and action execution. Thus, this task-oriented action enables outstanding generalization across a wide range of real-world scenarios. In the reasoning layer, task-specific chains of thought (CoT) are designed for planning long-horizon task sequences and providing meta-action signals. In the acting layer, a simple but powerful Mixture of Policy Experts strategy is built inspired by signals and multimodal inputs for adaptively selecting skill experts and generating closed-loop action sequences. Also, it integrates a lightweght Multimodal Diffusion Policy (MMDP) to enhance spatial perception by fusing multi-visual modality features. Besides, the pioneering memory mechanism establishes feedback between the reasoning and acting layers, facilitating adaptive execution of long-horizon tasks and real-time replanning. Notably, we create $\textbf{SandGo-1k}$ and $\textbf{SandThink-21k}$, the first expert-level multimodal CoT dataset and embodied dataset tailored for sandy environments. At a high execution frequency of 30 FPS, MindExplore is 3.01 $\times$ more successful than existing methods in unstructured and dynamic environments.

Live content is unavailable. Log in and register to view live content