ICCV Poster Not Only Vision: Evolve Visual Speech Recognition via Peripheral Information

Poster

Not Only Vision: Evolve Visual Speech Recognition via Peripheral Information

Zhaoxin Yuan · Shuang Yang · Shiguang Shan · Xilin Chen

[ Abstract ]

Abstract:

Visual Speech Recognition (VSR) aims to infer spoken content by analyzing the speaker’s facial dynamics. While this technology has shown promise, a question naturally arises: Is it sufficient to rely solely on such visual information in complex real-world scenarios?Humans, on the other hand, excel at lip-reading by leveraging information beyond lip movements, such as speech-related background and prior knowledge about the task. Despite this well-recognized human capability, existing approaches have not explored incorporating such \textbf{Peripheral Information} into automatic frameworks.We categorize peripheral information into a hierarchical structure based on its relevance to the spoken content: (1) Content Anchors (e.g., speech topic or description), (2) Task Expertise (task-related background, e.g., human prior lip-reading experiences), and (3) Linguistic Perturbation (irrelevant information that VSR systems should process alongside meaningful signals).To unlock the valuable clues embedded in peripheral information, we propose a novel multi-modal framework that utilizes a large language model (LLM) to decode spoken content while seamlessly integrating peripheral information.Center to our framework is a new adaptation method, Synergy LoRA, which enables a coordinated adaptation of visual and textual inputs.Visual features are processed with a independent module while guided by semantic cue from peripheral information by a MoE textual adaptation module. It preserves the fine-grained spatiotemporal details of the visual modality and incorporates peripheral information to enhance recognition.On the widely-used LRS3 dataset, with readily available peripheral information, our model achieves a Word Error Rate (WER) of 22.0\%, surpassing recent approaches.Further experiments on the challenging AVSpeech dataset also show promising results in handling complex real-world scenarios.

Live content is unavailable. Log in and register to view live content