Skip to yearly menu bar Skip to main content


Poster

Hierarchy-Aware Pseudo Word Learning with Text Adaptation for Zero-Shot Composed Image Retrieval

Zhe Li · Lei Zhang · Zheren Fu · Kun Zhang · Zhendong Mao


Abstract:

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve the target image based on a reference image and a text describing the user's intention without training on the triplet datasets. The key to this task is to make specified changes to specific objects in the reference image based on the text. Previous works generate single or multiple pseudo words by projecting the reference image to the word embedding space. However, these methods ignore the fact that the editing objects of CIR are naturally hierarchical, and lack the ability of text adaptation, thus failing to adapt to multi-level editing needs. In this paper, we argue that the hierarchical object decomposition is the key to learning pseudo words, and propose a hierarchy-aware dynamic pseudo word learning (HIT) framework to equip with HIerarchy semantic parsing and Text-adaptive filtering. The proposed HIT enjoys several merits. First, HIT is empowered to dynamically decompose the image into different granularity of editing objects by a set of learnable group tokens as guidance, thus naturally forming the hierarchical semantic concepts. Second, the text-adaptive filtering strategy is proposed to screen out specific objects from different levels based on the text, so as to learn hierarchical pseudo words that meet diverse editing needs. Extensive experiments on three challenging benchmarks show that HIT outperforms previous state-of-the-art ones by 5%-8% in average recall.

Live content is unavailable. Log in and register to view live content