ICCV Poster Automated Red Teaming for Text-to-Image Models through Feedback-Guided Prompt Iteration with Vision-Language Models

Poster

Automated Red Teaming for Text-to-Image Models through Feedback-Guided Prompt Iteration with Vision-Language Models

Wei Xu · Kangjie Chen · Jiawei Qiu · Yuyang zhang · Run Wang · Jin Mao · Tianwei Zhang · Lina Wang

Exhibit Hall I #1718

[ Abstract ]

Wed 22 Oct 5:45 p.m. PDT — 7:45 p.m. PDT

Abstract:

Text-to-image models have achieved remarkable progress in generating high-quality images from textual prompts, yet their potential for misuse like generating unsafe content remains a critical concern.Existing safety mechanisms, such as filtering and fine-tuning, remain insufficient in preventing vulnerabilities exposed by adversarial prompts. To systematically evaluate these weaknesses, we propose an automated red-teaming framework, Feedback-Guided Prompt Iteration (FGPI), which utilizes a Vision-Language Model (VLM) as the red-teaming agent following a feedback-guide-rewrite paradigm for iterative prompt optimization.The red-teaming VLM analyzes prompt-image pairs based on evaluation results, provides feedback and modification strategies to enhance adversarial effectiveness while preserving safety constraints, and iteratively improves prompts.To enable this functionality, we construct a multi-turn conversational VQA dataset with over 6,000 instances, covering seven attack types and facilitating the fine-tuning of the red-teaming VLM.Extensive experiments demonstrate the effectiveness of our approach, achieving over 90\% attack success rate within five iterations while maintaining prompt stealthiness and safety. The experiments also validate the adaptability, diversity, transferability, and explainability of FGPI.The source code and dataset are available at (URL omitted for double-blind reviewing; code available in supplementary materials).

Live content is unavailable. Log in and register to view live content