ICCV Poster 3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation

Poster

3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation

Jianzhe Gao · Rui Liu · Wenguan Wang

[ Abstract ]

Abstract:

Vision-language navigation (VLN) task requires an agent to traverse complex 3D environments based on natural language instructions, necessitating a thorough scene understanding. While existing works provide agents with various scene maps to enhance their spatial awareness, integrating 3D geometric priors and semantics into a unified map remains challenging. Moreover, these methods often neglect to account for the complex spatial relationships and the open nature of VLN scenarios in their map design, which limits their ability to generalize across diverse and unseen environments. To address these challenges, this work proposes a 3D Gaussian Map that represents the environment as a set of differentiable 3D Gaussians and accordingly develops a navigation strategy for VLN. Specifically, Egocentric Gaussian Map is constructed online by initializing 3D Gaussians from sparse pseudo-lidar point clouds, providing informative geometric priors to boost spatial awareness. Each Gaussian primitive is further enriched through Open-Set Semantic Grouping operation, which groups 3D Gaussians based on their membership in object instances or stuff categories within the open world. These processes result in a unified 3D Gaussian Map that integrates geometric priors with open-set semantics. Building on this map, Multi-Level Action Prediction strategy, which combines spatial-semantic cues at multiple granularities, is designed to assist the agent in VLN decision-making. Experiments on three public benchmarks (i.e., R2R, R4R, and REVERIE) validate the effectiveness of our approach. The code will be released.

Live content is unavailable. Log in and register to view live content