ICCV Poster Cultural Gaps in the Long Tail of Text-to-Image Models

Poster

Cultural Gaps in the Long Tail of Text-to-Image Models

Aniket Rege · Zinnia Nie · Unmesh Raskar · Mahesh Ramesh · Zhuoran Yu · Aditya Kusupati · Yong Jae Lee · Ramya Vinayak

[ Abstract ]

Abstract:

Popular text-to-image (T2I) models are trained on web-scraped data, which is heavily Amero and Euro-centric, underrepresenting the cultures of the Global South. To analyze these biases, we introduce CuRe, a novel benchmarking and scoring suite for cultural representativeness that leverages the marginal utility of attribute specification to text-to-image systems as a proxy for human judgments. Our CuRe dataset has a novel categorical hierarchy that enables benchmarking T2I systems in this manner, with 32 cultural subcategories across six broad cultural axes (food, art, fashion, architecture, celebrations, and people), built from the crowdsourced Wikimedia knowledge graph. Unlike flawed existing benchmarks, which suffer from ``generative entanglement'' due to overlapping training and evaluation data, CuRe enables fine-grained cultural comparisons. We empirically observe much stronger correlations of our class of scorers to human judgments of perceptual similarity, image-text alignment, and cultural diversity across image encoders (SigLIP2, AIMv2 and DINOv2), image-text models (CLIP, SigLIP) and state-of-the-art text-to-image systems including Stable Diffusion 3.5 Large and Flux.1. Code and benchmark dataset is available at: \textbf{hidden for double blind}

Live content is unavailable. Log in and register to view live content