ICCV Poster When Pixel Difference Patterns Meet ViT: PiDiViT for Few-Shot Object Detection

Poster

When Pixel Difference Patterns Meet ViT: PiDiViT for Few-Shot Object Detection

Hongliang hongliang · Yongxiang Liu · Canyu Mo · Weijie Li · Bowen Peng · Li Liu

[ Abstract ]

Abstract:

Few-shot object detection aims to detect novel classes with limited samples. Due to boundary and scale discrepancies with base classes, novel classes exhibit suboptimal performance under limited samples. Although recent methods leverage rich semantic representations of pretrained ViT to overcome limitations of model fine-tuning, thereby enhancing novel class performance, designing a ViT architecture that addresses boundary and scale issues to balance base and novel class performance remains challenging: (1) modeling feature distinctions at object boundaries at pixel level while preserving global information; and (2) applying scale-specific extraction for images containing multiscale objects, adaptively capturing of local details and global contours. So Pixel Difference Vision Transformer (PiDiViT) is proposed. Innovations include: (1) difference convolution fusion module (DCFM), which achieves precise object boundary localization and effective preservation of global object information by integrating direction-sensitive differential feature maps of pixel neighborhoods with original feature maps; and (2) multiscale feature fusion module (MFFM), which adaptively fuses features extracted by five different scale convolutional kernels using a scale attention mechanism to generate attention weights, achieving an optimal balance between local detail and global semantic information extraction. PiDiViT achieves SOTA on COCO benchmark: surpassing few-shot detection SOTA by 2.7 nAP50 (10-shot) and 4.0 nAP50 (30-shot) for novel classes, exceeding one-shot detection SOTA by 4.4 nAP50 and open-vocabulary detection SOTA by 3.7 nAP50. The code will be public.

Live content is unavailable. Log in and register to view live content