Poster
Region-based Cluster Discrimination for Visual Representation Learning
Yin Xie · Kaicheng Yang · Xiang An · Kun Wu · Yongle Zhao · Weimo Deng · Zimin Ran · Yumeng Wang · Ziyong Feng · Roy Miles · Ismail Elezi · Jiankang Deng
The vision towers of Multimodal Language Models (MLLM) have significantly enhanced the performance of large multimodal models. This success is primarily attributed to extensive language alignment training, which enhances human-like understanding. However, these models predominantly rely on global category representations, limiting their performance in tasks that require localized representations, such as grounding, OCR, and segmentation. To address this limitation, we propose a novel Locality-Aware Cluster Contrastive Learning strategy. Our approach leverages local feature clustering and contrastive learning to improve the model's ability to understand and represent localized information. Furthermore, our method can be easily scaled to billion-level training, ensuring its applicability to large-scale datasets and models. We demonstrate the effectiveness of our method by achieving state-of-the-art results on the Visual Question Answering (VQA) and RefCOCO benchmarks, showcasing its superior capabilities in handling tasks that require fine-grained visual understanding. Our results indicate a significant improvement in performance, validating the potential of our approach in advancing MLLM tasks. It outperforms the widely used SigLIP.
Live content is unavailable. Log in and register to view live content