Poster
MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval
Jaeseok Byun · Young Kyun Jang · Seokhyeon Jeong · Donghyun Kim · Taesup Moon
Composed Image Retrieval (CIR) seeks to retrieve a target image by using a reference image and conditioning text specifying desired modifications. While recent approaches have shown steady performance improvements on existing CIR benchmarks, we argue that it remains unclear whether these gains genuinely reflect an enhanced compositional understanding of both visual and textual information.For example, current benchmarks do not explicitly consider negation cases and offer limited semantic diversity, with insufficient hard negatives to thoroughly evaluate the CIR task.To bridge this gap, we introduce Multimodal Arithmetic Benchmark for CIR (MA-CIR), a challenging CIR benchmark that integrates arithmetic types (negation, replacement, and addition) across seven complex semantic categories (e.g., spatial reasoning, object reasoning, etc). Moreover, carefully constructed hard negatives are incorporated to assess models in a controlled setting.In MA-CIR, we observe that current CIR models struggle with negation (or replacement) arithmetic types and semantic types that require complex reasoning, indicating a potential reliance on object or entity information.To address this challenge, we propose leveraging strong text encoders, particularly those based on large language models (LLMs), in conjunction with carefully constructed text triplets that incorporate hard negatives to enhance compositional understanding.As a result, MA-CIR achieves a 14\% gain while also improving R@1 on CIRR by 6\%, all within a fast training time (under 2 hours using a single A100 GPU).
Live content is unavailable. Log in and register to view live content