Poster
Multi-modal Segment Anything Model for Camouflaged Scene Segmentation
Guangyu Ren · Hengyan Liu · Michalis Lazarou · Tania Stathaki
Camouflaged scenes, where objects blend seamlessly into their environments, pose significant challenges to both human observers and computer vision systems. These objects match the background in color, texture, and shape, making them difficult to detect. To this end, we propose leveraging the Segment Anything Model (SAM) to tackle this challenging task effectively. Specifically, we propose how to exploit SAM without requiring any manual prompts by proposing several ideas. At the core of our method lies the rich information extracted through multi-modal prompts. At first, we generate an image caption using the BLIP model and obtain its text embedding through the use of a text encoder. We then generate a visual embedding through the vision encoder of the BLIP model and use both as inputs to SAM to provide additional semantic information about the image. Finally, we propose a couple of architectural novelties, a) we effectively integrate the multi-modal information in SAM through a multi-level adapter and b) we replace the dense embedding of SAM with the image embedding of its image encoder. Our method achieves new state-of-the-art performance in 11 out of 12 metrics in three benchmark datasets for camouflaged detection. Additionally, our method can be successfully adapted to other tasks such as medical image segmentation performing on par or even outperforming the state-of-the-art methods. Our code is available in the supplementary material.
Live content is unavailable. Log in and register to view live content