Poster
JailbreakDiffBench: A Comprehensive Benchmark for Jailbreaking Diffusion Models
Xiaolong Jin · Zixuan Weng · Hanxi Guo · Chenlong Yin · Siyuan Cheng · Guangyu Shen · Xiangyu Zhang
Diffusion models are widely used in real-world applications, but ensuring their safety remains a major challenge. Despite many efforts to enhance the security of diffusion models, jailbreak and adversarial attacks can still bypass these defenses, generating harmful content. However, the lack of standardized evaluation makes it difficult to assess the robustness of diffusion model system.To address this, we introduce JailbreakDiffBench, a comprehensive benchmark for systematically evaluating the safety of diffusion models against various attacks and under different defenses. Our benchmark includes a high-quality, human-annotated prompt and image dataset covering diverse attack scenarios. It consists of two key components: (1) an evaluation protocol to measure the effectiveness of moderation mechanisms and (2) an attack assessment module to benchmark adversarial jailbreak strategies.Through extensive experiments, we analyze existing filters and reveal critical weaknesses in current safety measures. JailbreakDiffBench is designed to support both text-to-image and text-to-video models, ensuring extensibility and reproducibility.The code is available at https://anonymous.4open.science/r/jailbreakdiffbench/
Live content is unavailable. Log in and register to view live content