ICCV Poster MissRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models

Poster

MissRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models

Vittorio Pipoli · Alessia Saporita · Federico Bolelli · Marcella Cornia · Lorenzo Baraldi · Costantino Grana · Rita Cucchiara · Elisa Ficarra

Exhibit Hall I #297

[ Abstract ]

Tue 21 Oct 2:45 p.m. PDT — 4:45 p.m. PDT

Abstract:

Recently, Multimodal Large Language Models (MLLMs) have emerged as a leading framework for enhancing the ability of Large Language Models (LLMs) to interpret non-linguistic modalities. Despite their impressive capabilities, the robustness of MLLMs under conditions where one or more modalities are missing remains largely unexplored. In this paper, we investigate the extent to which MLLMs can maintain performance when faced with missing modality inputs. Moreover, we propose a novel framework to mitigate the aforementioned issue called Retrieval-Augmented Generation for missing modalities (MissRAG). It consists of a novel multimodal RAG technique alongside a tailored prompt engineering strategy designed to enhance model robustness by mitigating the impact of absent modalities while preventing the burden of additional instruction tuning. To demonstrate the effectiveness of our techniques, we conducted comprehensive evaluations across five diverse datasets, covering tasks such as audio-visual question answering, audio-visual captioning, and multimodal sentiment analysis. Our source code is available at https://anonymous.4open.science/r/MM_MLLM-1536

Live content is unavailable. Log in and register to view live content