Skip to yearly menu bar Skip to main content


Poster

GaussianSpeech: Audio-Driven Personalized 3D Gaussian Avatars

Shivangi Aneja · Artem Sevastopolsky · Tobias Kirschstein · Justus Thies · Angela Dai · Matthias Nießner


Abstract:

We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photorealistic and personalized multi-view consistent 3D human head avatars from spoken audio at real-time rendering rates. To capture the expressive and detailed nature of human heads, including skin furrowing and fine facial movements, we propose to couple speech signal with 3D Gaussian splatting to create photorealistic and temporally coherent motion sequences. We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize dynamic facial details at real-time rendering. Next, we devise an audio-conditioned transformer model to extract lip and wrinkle features from the audio input and combine with our 3D avatar by performing joint 3D sequence refinement to synthesize photorealistic animations. To the best of our knowledge, this is the first work for generating photorealistic multi-view 3D head avatar sequence only from spoken audio, representing a significant advancement in the field of audio-driven 3D facial animation. In the absence of high-quality multi-view talking face dataset, we captured a new large-scale multi-view dataset of audio-visual sequences of native English speakers and diverse facial geometry. GaussianSpeech achieves state-of-the-art quality consistent with the avatar's speaking style.

Live content is unavailable. Log in and register to view live content