ICCV Poster CLIP-Adapted Region-to-Text Learning for Generative Open-Vocabulary Semantic Segmentation

Poster

CLIP-Adapted Region-to-Text Learning for Generative Open-Vocabulary Semantic Segmentation

Jiannan Ge · Lingxi Xie · Hongtao Xie · Pandeng Li · Sun-Ao Liu · XIAOPENG ZHANG · Qi Tian · Yongdong Zhang

[ Abstract ]

Abstract:

In recent years, Open-Vocabulary Semantic Segmentation (OVSS) has been largely advanced. However, existing methods mostly rely on a pre-trained vision-language model (e.g., CLIP) and require a predefined set of classes to guide the semantic segmentation process during the inference. This not only narrows the application scenario but also constrains comprehension within a finite vocabulary. To overcome this, we reformulate OVSS as a text generation task and propose the CLIP-adapted Region-to-Text Network (CRTNet) that achieves vocabulary-free OVSS by generating category names and descriptions upon segmentation masks. The training process consists of two steps to ensure an accurate and detailed interpretation of the masked regions: (i) the initial step adapts CLIP visual features to mask-level proposal features using binarized masks extracted by a trained mask extractor, and (ii) the subsequent step involves aggregating these features to become text-aware by integrating CLIP text embeddings, effectively aligning visual data with corresponding linguistic data to facilitate region-to-text learning. Furthermore, we introduce a series of parsing and filtering techniques to integrate multiple sources of training data to improve the generalization ability of our model. Experiments demonstrate that our model not only excels in OVSS but also exhibits scalability and can be adapted to various foundation models (e.g., SAM) without being retrained.

Live content is unavailable. Log in and register to view live content