Poster
Intermediate Connectors and Geometric Priors for Language-Guided Affordance Segmentation on Unseen Object Categories
Yicong Li · Yiyang Chen · Zhenyuan Ma · Junbin Xiao · Xiang Wang · Angela Yao
Language-guided Affordance Segmentation (LASO) aims to identify actionable object regions based on text instructions. At the core of its practicality is learning generalizable affordance knowledge that captures functional regions across diverse objects. However, current LASO solutions struggle to extend learned affordances to object categories that are not encountered during training. Scrutinizing these designs, we identify limited generalizability on unseen categories, stemming from (1) underutilized generalizable patterns in the intermediate layers of both 3D and text backbones, which impedes the formation of robust affordance knowledge, and (2) the inability to handle substantial variability in affordance regions across object categories due to a lack of structural knowledge of the target region.Towards this, we introduce a \textbf{G}enera\textbf{L}ized fr\textbf{A}mework on u\textbf{N}seen \textbf{C}ategori\textbf{E}s (GLANCE), incorporating two key components: a cross-modal connector that links intermediate stages of the text and 3D backbones to enrich pointwise embeddings with affordance concepts, and a VLM-guided query generator that provides affordance priors by extracting a few 3D key points based on the intra-view reliability and cross-view consistency of their multi-view segmentation masks. Extensive experiments on two benchmark datasets demonstrate that GLANCE outperforms state-of-the-art methods (SoTAs), with notable improvements in generalization to unseen categories. Our code is available at \url{https://anonymous.4open.science/r/GLANCE}.
Live content is unavailable. Log in and register to view live content