Poster
Sliced Wasserstein Bridge for Open-Vocabulary Video Instance Segmentation
Zheyun Qin · Deng Yu · Chuanchen Luo · Zhumin Chen
In recent years, researchers have explored the task of open-vocabulary video instance segmentation, which aims to identify, track, and segment any instance within an open set of categories. The core challenge of Open-Vocabulary VIS lies in solving the cross-domain alignment problem, including spatial-temporal and text-visual domain alignments. Existing methods have made progress but still face shortcomings in addressing these alignments, especially due to data heterogeneity. Inspired by metric learning, we propose an innovative Sliced Wasserstein Bridging Learning Framework. This framework utilizes the Sliced Wasserstein distance as the core tool for metric learning, effectively bridging the four domains involved in the task. Our innovations are threefold: (1) Domain Alignment: By mapping features from different domains into a unified metric space, our method maintains temporal consistency and learns intrinsic consistent features between modalities, improving the fusion of text and visual information. (2) Weighting Mechanism: We introduce an importance weighting mechanism to enhance the discriminative ability of our method when dealing with imbalanced or significantly different data. (3) High Efficiency: Our method inherits the computational efficiency of the Sliced Wasserstein distance, allowing for online processing of large-scale video data while maintaining segmentation accuracy. Through extensive experimental evaluations, we have validated the robustness of our concept and the effectiveness of our framework.
Live content is unavailable. Log in and register to view live content