Poster Thu, Oct 23, 2025 • 2:15 PM – 4:15 PM PDT Exhibit Hall I #226

Zero-Shot Composed Image Retrieval via Dual-Stream Instruction-Aware Distillation

Wenliang Zhong · Rob Barton · Weizhi An · Feng Jiang · Hehuan Ma · Yuzhi Guo · Abhishek Dan · Shioulin Sam · Karim Bouyarmane · Junzhou Huang

[ Poster]

Abstract

Composed Image Retrieval (CIR) targets the retrieval of images conditioned on a reference image and a textual modification, but constructing labeled triplets (reference image, textual modification, target image) is inherently challenging. Existing Zero-Shot CIR (ZS-CIR) approaches often rely on well-aligned vision-language models (VLMs) to combine visual and textual inputs, or use large language models (LLMs) for richer modification understanding. While LLM-based methods excel in capturing textual details, they are computationally costly, slow to infer, and often restricted by proprietary constraints. In this paper, we argue that the superior performance of LLM-based ZS-CIR methods primarily stems from their capacity to follow instructions, an aspect largely missing in more efficient projection-based models built upon VLMs. To bridge this gap, we introduce DistillCIR, a dual-stream distillation framework that transfers LLMs’ instruction-following capability into compact, projection-based architectures. By synthesizing triplet data with an LLM and incorporating a novel reasoning process, DistillCIR learns both composed retrieval and instruction awareness. In addition, we train an open-source multimodal LLM on the generated data, and further distill its instruction-aware embeddings into the projection-based model. Without any reliance on LLMs at inference, DistillCIR significantly surpasses state-of-the-art ZS-CIR methods in both performance and efficiency, offering a promising direction for instruction-aware, lightweight CIR.

Chat is not available.