Poster
Mixture-of-Scores: Robust Image-Text Data Quality Score via Three Lines of Code
WU Sitong · Haoru Tan · Yukang Chen · Shaofeng Zhang · Jingyao Li · Bei Yu · Xiaojuan Qi · Jiaya Jia
Evaluating the quality of image-text pair data plays a crucial role in various data processing strategies for vision-language pre-training. Currently, most popular metrics rely on off-the-shelf vision-language models to generate quality scores for paired image and text based on their feature similarity, such as CLIP-Score. However, we observe a prevalent phenomenon, that is, different scoring models yield varying quality scores for the same data. This quality score disparity directly affects the result of data processing, leading to the discrepancy between datasets processed using different quality scores. Subsequently, this dataset disparity further results in the performance differences of models individually trained on the dataset processed by distinct quality scores. Notably, no single quality score performs optimally across all evaluation tasks. Each score exhibits an inherent bias towards certain concepts or tasks, and different scores have complementary effects on the model performance. This brings great confusion when choosing the scoring model. In this paper, we first investigate these disparity phenomena and analyze the reason. Then, we propose a simple yet effective method, named Mixture-of-Scores (MoS), to extract the essence of existing quality scores while eliminating their biases by integrating them into a more robust score based on a data-adaptive ensemble strategy. Particularly, it can be easily implemented with only three lines of code. Extensive experiments demonstrate the superiority and robustness of our MoS compared with any existing single quality score across a variety of vision-language tasks and benchmarks. We hope that our work can provide novel perspectives and practical tools, liberating the community from the quandary of choosing a scoring model.
Live content is unavailable. Log in and register to view live content