Poster
Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text.
Bingchao Wang · Zhiwei Ning · Jianyu Ding · Xuanang Gao · Yin Li · Dongsheng Jiang · JIE YANG · Wei Liu
[
Abstract
]
Abstract:
CLIP has shown promising performance across many short-text tasks in a zero-shot manner. However, limited by the input length of the text encoder, CLIP struggles on under-stream tasks with long-text inputs ($>77$ tokens). To improve long-text understanding while preserving short-text capabilities, we propose Fix-CLIP which includes three novel modules: (1) A dual-branch training pipeline that aligns short and long texts with masked and raw images respectively, which boosts the long-text representation while preserving the short-text ability. (2) Multiple learnable regional prompts with unidirectional masks in Transformer layers for regional information extraction. (3) A hierarchical feature alignment module in the intermediate encoder layers to promote the consistency of multi-scale features. Furthermore, we collect 30M images and utilize existing MLLMs to synthesize long-text captions for training. Extensive experiments show that Fix-CLIP achieves state-of-the-art performance on both long-text and short-text retrieval benchmarks. For downstream applications, we reveal that Fix-CLIP's text encoder delivers promising performance in a plug-and-play manner for diffusion models with long-text input.
Live content is unavailable. Log in and register to view live content