Generative Zero-Shot Composed Image Retrieval

Lan Wang¹ Wei Ao¹ Vishnu Boddeti¹ Ser-Nam Lim²

¹ Michigan State University ² University of Central Florida

CVPR 2025

Zero-Shot Composed Image Retrieval vs. Pseudo Target-Aided Composed Image Retrieval. Conventional ZS-CIR methods map the image latent embedding into the token embedding space by textual inversion. The proposed Pseudo Target-Aided method provide additional information for composed embeddings from pseudo-target images.

Abstract

Composed Image Retrieval (CIR) is a vision-language task utilizing queries comprising images and textual descriptions to achieve precise image retrieval. This task seeks to find images that are visually similar to a reference image while incorporating specific changes or features described textually (visual delta). CIR enables a more flexible and user-specific retrieval by bridging visual data with verbal instructions. This paper introduces a novel generative method that augments Composed Image Retrieval by Composed Image Generation (CIG) to provide pseudo-target images. CIG utilizes a textual inversion network to map reference images into semantic word space, which generates pseudo-target images in combination with textual descriptions. These images serve as additional visual information, significantly improving the accuracy and relevance of retrieved images when integrated into existing retrieval frameworks. Experiments conducted across multiple CIR datasets and several baseline methods demonstrate improvements in retrieval performance, which shows the potential of our approach as an effective add-on for existing composed image retrieval.

Contributions

We explore an effective generative method for zero-shot compositional image retrieval, CIG, which can be combined with any CIR methods.

Our training process does not require any triplets, utilizing only image-caption pairs in a self-supervised training regime.

We conduct multiple experiments on different baselines and obtain significant improvements over different benchmarks.

CIG provides a new direction for the CIR task, directly generating a new image by users' instruction while staying faithful to the reference image.

Method

Overview of the proposed Composed Image Generation (CIG). During training, CIG model uses composed prompt embeddings as textual conditions, and learn image information from them. In inference stage, the reference image and delta caption form the composed prompt embedding, which CIG model utilizes to generate pseudo target images. These pseudo target images assist in improving ZS-CIR. Top: the training process, including textual inversion network pretraining (left) and CIG model pertaining (right); bottom: inference process for CIR.

Qualitative Results

Takeaways

BibTex

 
    @inproceedings{wang2025generative,

      title={Generative Zero-Shot Composed Image Retrieval},

      author={Wang, Lan and Ao, Wei and Boddeti, Vishnu Naresh and Lim, Ser-Nam},

      booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},

      year={2025}

}