Poster
MIEB: Massive Image Embedding Benchmark
Chenghao Xiao · Isaac Chung · Imene Kerboua · Jamie Stirling · Xin Zhang · Márton Kardos · Roman Solomatin · Noura Al Moubayed · Kenneth Enevoldsen · Niklas Muennighoff
[
Abstract
]
Abstract:
Image representation learning and image-text alignment have advanced rapidly, becoming key components in multi-modal research. However, these advancements are often evaluated through distinct, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear how capabilities measured by linear probing translate to retrieval and vice-versa. We introduce the Massive Image Embedding Benchmark (MIEB), a comprehensive benchmark designed to evaluate the capabilities of image embeddings across the broadest spectrum of tasks to date. MIEB spans 8 task categories, covering 130 tasks and a total of 39 languages. By benchmarking the performance of 50 models, MIEB uncovers hidden capabilities of advanced vision models beyond semantic alignment, such as their accurate visual representation of text; but also reveals their yet limited capabilities in robust compositionality and interleaved encoding. The benchmark aims to provide insights for guiding the design of universal image embeddings that encode multi-modal information. Additionally, we show that vision encoders' performance on MIEB tasks highly correlates with MLLMs' performance on downstream tasks, such as Visual STS tasks' over $99\%$ correlation with MLLMs' performance on OCRBench and TextVQA. Our findings underscore the importance of assessing vision embeddings beyond classification and retrieval tasks, highlighting their role in building multi-modal generative systems. MIEB comes with open-source code, datasets, and a leaderboard.
Live content is unavailable. Log in and register to view live content