ICCV Poster CogCM: Cognition-Inspired Contextual Modeling for Audio-Visual Speech Enhancement

Poster

CogCM: Cognition-Inspired Contextual Modeling for Audio-Visual Speech Enhancement

Feixiang Wang · Shuang Yang · Shiguang Shan · Xilin Chen

Exhibit Hall I #1971

[ Abstract ]

Thu 23 Oct 2:15 p.m. PDT — 4:15 p.m. PDT

Abstract:

Audio-Visual Speech Enhancement (AVSE) leverages both audio and visual information to improve speech quality.Despite noisy real-world conditions, humans are generally able to perceive and interpret corrupted speech segments as clear. Researches in cognitive science have shown how the brain merges auditory and visual inputs to achieve this.These studies uncover four key insights for AVSE, reflecting a hierarchical synergy of semantic and signal processes with visual cues enriching both levels:(1) Humans utilize high-level semantic context to reconstruct corrupted speech signals.(2) Visual cues are shown to strongly correlate with semantic information, enabling visual cues to facilitate semantic context modeling.(3) Visual appearance and vocal information jointly benefit identification, implying that visual cues strengthen low-level signal context modeling.(4) High-level semantic knowledge and low-level auditory processing operate concurrently, allowing the semantics to guide signal-level context modeling.Motivated by these insights, we propose CogCM, a cognition-inspired hierarchical contextual modeling framework. The CogCM framework includes three core modules: (1) A semantic context modeling module (SeCM) to capture high-level semantic context from both audio and visual modalities; (2) A signal context modeling module (SiCM) to model fine-grained temporal-spectral structures under multi-modal semantic context guidance; (3) A semantic-to-signal guidance module (SSGM) to leverage semantic context in guiding signal context modeling across both temporal and frequency dimensions.Extensive experiments on 7 benchmarks demonstrate CogCM's superiority, especially achieving 63.6\% SDR and 58.1\% PESQ improvements at -15dB SNR -- outperforming state-of-the-art methods across all metrics.

Live content is unavailable. Log in and register to view live content