Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval
Dohwan Ko · Ji Soo Lee · Minhyuk Choi · Zihang Meng · Hyunwoo Kim
Abstract
Text-Video Retrieval has been extensively studied to accurately retrieve the most relevant text (or video) candidate given a video (or text) query from large-scale online databases. With the advancement of multi-modal large language models (MLLMs), recent studies have proposed MLLM-based retrieval systems to enhance retrieval performance, particularly for long and complex query-candidate pairs. However, we observe that the naive application of MLLMs, $\textit{i.e.}$, retrieval based on candidate likelihood, introduces $\textit{candidate prior bias}$, wherein candidates with inherently higher prior probabilities are favored over those that are more relevant to the query. To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM ($\textbf{BLiM}$), which leverages query likelihood as well as candidate likelihood by training the model to generate text from a given video as well as video features from a given text. Furthermore, we introduce Candidate Prior Normalization ($\textbf{CPN}$), a simple yet effective training-free score calibration module designed to mitigate candidate prior bias in candidate likelihood. On four Text-Video Retrieval benchmarks, our BLiM equipped with CPN outperforms previous state-of-the-art models by an average margin of 6.4 in R@1, effectively alleviating candidate prior bias and emphasizing the relevance between the query and candidate. Our in-depth analysis across various multi-modal tasks beyond retrieval highlights the broad applicability of CPN which enhances visual understanding by reducing reliance on textual priors.
Chat is not available.
Successful Page Load