Poster
Think Twice: Test-Time Reasoning for Robust CLIP Zero-Shot Classification
Shenyu Lu · Zhaoying Pan · Xiaoqian Wang
Contrastive Language-Image Pre-training (CLIP) models exhibit intriguing properties, particularly in their zero-shot classification capability. However, the reliability of CLIP zero-shot classification is severely undermined by spurious correlations. Existing efforts to enhance the robustness of zero-shot CLIP models often rely on prior knowledge or annotations of spurious correlations, limiting real-world applicability due to the unavailability of such information. Alternative methods attempt to detect distribution shift at test time but require training statistics whose access is often restricted or computationally expensive. To address the challenges brought by spurious correlation under zero-shot settings, we propose a novel test-time reasoning approach. Our method, inspired by human recognition, localizes the object and refines the classification accordingly. The inherent capacity of CLIP for semantic understanding allows us to isolate the object of interest without auxiliary models. Zero-shot classification is then performed exclusively on the localized objects, effectively mitigating the influence of spurious correlation. The proposed approach is interpretable and flexible as it requires no spurious annotations or prior knowledge, making it widely applicable. The substantial improvements across multiple benchmark datasets validated the effectiveness of our approach.
Live content is unavailable. Log in and register to view live content