Poster
Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences
Hyojin Bahng · Caroline Chan · Fredo Durand · Phillip Isola
Current metrics for image-text alignment rely on human preferences or task-oriented VQA datasets for supervision. We propose an alternative approach that leverages cycle consistency as a supervisory signal. Given an image, we generate diverse captions using image-to-text models, then map these captions back to image space with a text-to-image model. We compute a cycle consistency score by measuring perceptual similarity between the original and reconstructed image. The score is used to determine preferences over captions, i.e., more descriptive and accurate captions yield faithful reconstructions and are thus preferred over lower quality captions. Analogously, we can measure cycle consistency in the text-to-image-to-text direction by measuring textual similarity between an input caption and its reconstruction through the cycle. We explore both mapping directions, resulting in 398K image-to-text pairs and 468K text-to-image comparison pairs. Our reward model, trained on this dataset, outperforms state-of-the-art methods on detailed captioning tasks, with superior inference-time scalability when used as a verifier for Best-of-N evaluation. We will release our dataset, model, and code upon acceptance.
Live content is unavailable. Log in and register to view live content