Skip to yearly menu bar Skip to main content


Poster

Multi-Modal Multi-Task Unified Embedding Model (M3T-UEM): A Task-Adaptive Representation Learning Framework

Rohan Sharma · Changyou Chen · Feng-Ju Chang · Seongjun Yun · Xiaohu Xie · Rui Meng · Dehong Xu · Alejandro Mottini · qingjun cui


Abstract:

We present Multi-Modal Multi-Task Unified Embedding Model (M3T-UEM), a framework that advances vision-language matching and retrieval by leveraging a large language model (LLM) backbone. While concurrent LLM-based approaches like VLM2VEC, MM-Embed, NV-Embed, and MM-GEM have demonstrated impressive capabilities in multi-modal and multi-task scenarios, our work introduces novel mechanisms for task-adaptive learning and embedding extraction that further enhance the potential of LLM-based retrieval systems. Our key technical contribution lies in the development of a task-aware contrastive learning framework with an automated Bayesian weighing mechanism. This approach provides a principled way to balance multiple tasks during training, departing from conventional contrastive learning strategies. We further enhance the framework through a multiple-token summarization strategy and an auxiliary language modeling objective, which together significantly improve retrieval performance.Comprehensive experiments on M-BEIR and ICinW benchmarks demonstrate M3T-UEM's effectiveness, showing competitive or superior performance compared to both traditional encoder-based methods and recent LLM-based approaches. Furthermore, we demonstrate particular strengths in handling compositional conceptual changes and multilingual scenarios owing to the incorporation of an LLM backbone where the method drastically outperforms CLIP in zero-shot settings, often by orders of magnitude.

Live content is unavailable. Log in and register to view live content