Poster
Erasing More Than Intended? How Concept Erasure Degrades the Generation of Non-Target Concepts
Ibtihel Amara · Ahmed Imtiaz Humayun · Ivana Kajic · Zarana Parekh · Natalie Harris · Sarah Young · Chirag Nagpal · Najoung Kim · Junfeng He · Cristina Vasconcelos · Deepak Ramachandran · Golnoosh Farnadi · Katherine Heller · Mohammad Havaei · Negar Rostamzadeh
Concept erasure techniques have recently gained significant attention for their potential to remove unwanted concepts from text-to-image models. While these methods often demonstrate promising results in controlled settings, their robustness in real-world applications and suitability for deployment remain uncertain. In this work, we (1) identify a critical gap in evaluating sanitized models, particularly in assessing their performance across diverse concept dimensions, and (2) systematically analyze the failure modes of text-to-image models post-erasure. We focus on the unintended consequences of concept removal on non-target concepts across different levels of interconnected relationships including visually similar, binomial, and semantically related concepts. To enable a more comprehensive evaluation of concept erasure, we introduce EraseBench, a multidimensional framework designed to rigorously assess text-to-image models post-erasure. It encompasses over 100 diverse concepts, carefully curated seeded prompts to ensure reproducible image generation, and dedicated evaluation prompts for model-based assessment. Paired with a robust suite of evaluation metrics, our framework provides a holistic and in-depth analysis of concept erasure’s effectiveness and its long-term impact on model behaviour.Our findings reveal a phenomenon of concept entanglement, where erasure leads to unintended suppression of non-target concepts, causing spillover degradation that manifests as distortions and a decline in generation quality.
Live content is unavailable. Log in and register to view live content