Skip to yearly menu bar Skip to main content


Poster

VGGSounder: Audio-Visual Evaluations for Foundation Models

Daniil Zverev · Thaddäus Wiedemer · Ameya Prabhu · Matthias Bethge · Wieland Brendel · A. Koepke


Abstract:

Designing effective foundation models requires high-quality evaluation datasets. With the emergence of audio-visual foundation models, reliable assessment of their multi-modal understanding is essential. The current gold standard for evaluating audio-visual understanding is the popular classification dataset VGGSound. However, our analysis identifies several critical issues in VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These flaws lead to distorted evaluations of models' true auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set extending VGGSound that is explicitly designed to accurately evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance and revealing previously unnoticed model limitations. We believe VGGSounder offers a robust and reliable benchmark supporting the future development of audio-visual foundation models.

Live content is unavailable. Log in and register to view live content