Skip to yearly menu bar Skip to main content


Poster

Region-aware Anchoring Mechanism for Efficient Referring Visual Grounding

Shuyi Ouyang · Ziwei Niu · Hongyi Wang · Yen-wei Chen · Lanfen Lin


Abstract:

Referring Visual Grounding (RVG) tasks revolve around utilizing vision-language interactions to incorporate object information from language expressions, thereby enabling targeted object detection or segmentation within images. Transformer-based methods have enabled effective interaction through attention mechanisms, achieving notable performance in RVG tasks. However, existing strategies for RVG, which involve direct interaction between visual and linguistic features, face three key challenges: (i) tendency to focus on a single target, (ii) insufficient control over linguistic noise, and (iii) high computational cost. To address these challenges, we propose a Region-aware Anchoring Mechanism (RaAM) that mediates vision-language interactions. In RaAM, region-aware anchors engage in alternating interactions with vision and language modalities, acting as indicators for object presence across different regions within the image. RaAM (i) directs attention to multiple target regions for better localization, (ii) reduces cross-modal redundancy by using anchors as buffers, and (iii) lowers time complexity. In addition, we design region and pixel level loss functions to enhance object presence assessment and edge precision. We evaluate our RaAM-RVG on four benchmark datasets and integrate RaAM into various models by replacing their interaction design. Results show that RaAM outperforms state-of-the-art methods with lower computational cost. Code will be released publicly.

Live content is unavailable. Log in and register to view live content