Poster
Enhancing Zero-shot Object Counting via Text-guided Local Ranking and Number-evoked Global Attention
Shiwei Zhang · Qi Zhou · Wei Ke
Text-guided zero-shot object counting leverages vision-language models (VLMs) to count objects of an arbitrary class given by a text prompt. Existing approaches for this challenging task only utilize local patch-level features to fuse with text feature, ignoring the important influence of the global image-level feature. In this paper, we propose a universal strategy that can exploit both local patch-level features and global image-level feature simultaneously. Specifically, to improve the localization ability of VLMs, we propose Text-guided Local Ranking. Depending on the prior knowledge that foreground patches have higher similarity with the text prompt, a new local-text rank loss is designed to increase the differences between the similarity scores of foreground and background patches which push foreground and background patches apart. To enhance the counting ability of VLMs, Number-evoked Global Attention is introduced to first align global image-level feature with multiple number-conditioned text prompts. Then, the one with the highest similarity is selected to compute cross-attention with the global image-level feature. Through extensive experiments on widely used datasets and methods, the proposed approach has demonstrated superior advancements in performance, generalization, and scalability. Furthermore, to better evaluate text-guided zero-shot object counting methods, we propose a dataset named ZSC-8K, which is larger and more challenging, to establish a new benchmark. Codes and ZSC-8K dataset will be available.
Live content is unavailable. Log in and register to view live content