O-MaMa: Learning Object Mask Matching between Egocentric and Exocentric Views
Abstract
Understanding the world from multiple perspectives is essential for intelligent systems operating together, where segmenting common objects across different views remains an open problem.We introduce a new approach that re-defines cross-image segmentation by treating it as a mask matching task.Our method consists of: (1) A Mask-Context Encoder that pools dense DINOv2 semantic features to obtain discriminative object-level representations from FastSAM mask candidates, (2) a Ego↔Exo Cross-Attention that fuses multi-perspective observations, (3) a Mask Matching contrastive loss that aligns cross-view features in a shared latent space and,(4) a Hard Negative Adjacent Mining strategy to encourage the model to better differentiate between nearby objects.O-MaMa achieves the state of the art in the Ego-Exo4D Correspondences benchmark, obtaining relative gains of +31% and 94% in the Ego2Exo and Exo2Ego IoU against the official challenge baselines and a +13% and +6% compared with the SOTA with 1% of the training parameters.