Skip to yearly menu bar Skip to main content


Poster

SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning

Lin Zhang · Xianfang Zeng · Kangcong Li · Gang YU · Tao Chen


Abstract:

We propose SC-Captioner, a reinforcement learning framework that enables the self-correcting capability of image caption models. Our crucial technique lies in the design of the reward function to incentivize accurate caption corrections. pecifically, the predicted and reference captions are decomposed into object, attribute, and relation sets using scene-graph parsing algorithms. We calculate the set difference between sets of original and self-corrected captions to identify added and removed elements. These elements are matched against the reference sets to calculate recall bonuses for accurate corrections and hallucination punishments for wrong additions and removals, thereby forming the final reward. For image caption quality assessment, we propose a set of metrics refined from CAPTURE that alleviate its incomplete precision evaluation and inefficient relation matching problems. Experiments show that applying SC-Captioner on large visual-language models can generate better image captions across various scenarios, significantly outperforming the direct preference optimization training strategy.

Live content is unavailable. Log in and register to view live content