Poster Exhibit Hall I #448

DiffTell: A High-Quality Dataset for Describing Image Manipulation Changes

Zonglin Di ⋅ Jing Shi ⋅ Yifei Fan ⋅ Hao Tan ⋅ Alexander Black ⋅ John Collomosse ⋅ Yang Liu

2025 Poster

Abstract

The image difference captioning (IDC) task is to describe the distinctions between two images. However, existing datasets do not offer comprehensive coverage across all image-difference categories. In this work, we introduce a high-quality dataset, DiffTell with various types of image manipulations, including global image alterations, object-level changes, and text manipulations. The data quality is controlled by careful human filtering. Additionally, to scale up the data collection without prohibitive human labor costs, we explore the possibility of automatically filtering for quality control. We demonstrate that both traditional methods and recent multimodal large language models (MLLMs) exhibit performance improvements on the IDC task after training on the DiffTell dataset. Through extensive ablation studies, we provide a detailed analysis of the performance gains attributed to DiffTell. Experiments show DiffTell significantly enhances the availability of resources for IDC research, offering a more comprehensive foundation and benchmark for future investigations.

Chat is not available.