Skip to yearly menu bar Skip to main content


Poster

What we need is explicit controllability: Training 3D gaze estimator using only facial images

Tingwei Li · Jun Bao · Zhenzhong Kuang · Buyu Liu


Abstract:

This work focuses on unsupervised 3D gaze estimation. Specifically, we adopt a learning-by-synthesis approach, where a gaze prediction model is trained using simulated data. Unlike existing methods that lack explicit and accurate control over facial images—particularly the eye regions—we propose a geometrically meaningful 3D representation that enables diverse, precise, and explicit control over illumination, eye regions, and gaze targets using only facial images. Given a sequence of facial images, our method constructs a mesh representation where each mesh is associated with 3D Gaussians, allowing for explicit lighting control. To further enhance realism, we introduce eye-focused constraints, including a rotation symmetry protocol, as well as geometry and appearance losses for the eye regions, alongside conventional learning objectives. Additionally, we incorporate a virtual screen target and rotate the eyeballs accordingly, generating more accurate pseudo gaze directions paired with realistic facial images. We validate our approach through extensive experiments on three benchmarks. The results demonstrate that gaze estimators trained using our method outperform all unsupervised baselines and achieve performance comparable to cross-dataset approaches. Furthermore, our method generates the most visually realistic images, as confirmed by both objective and subjective image quality metrics.

Live content is unavailable. Log in and register to view live content