Skip to yearly menu bar Skip to main content


Poster

EgoM2P: Egocentric Multimodal Multitask Pretraining

Gen Li · Yutong Chen · Yiqian Wu · KAIFENG ZHAO · Marc Pollefeys · Siyu Tang


Abstract:

Understanding multimodal signals in egocentric vision, such as RGB video, depth, camera poses, and gaze, is essential for applications in augmented reality, robotics, and human-computer interaction, enabling systems to better interpret the camera wearer’s actions, intentions, and surrounding environment. However, building large-scale egocentric multimodal andmultitask models presents unique challenges. Egocentric data are inherently heterogeneous, with large variations in modality coverage across devices and settings. Generating pseudo-labels for missing modalities, such as gaze or head-mounted camera trajectories, is often infeasible, making standard supervised learning approaches difficult to scale. Furthermore, dynamic camera motion and the complex temporal and spatial structure of first-person video pose additional challenges for the direct application of existing multimodal foundation models.To address these challenges, we introduce a set of efficient temporal tokenizers and propose EgoMLVM, a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose video model for egocentric understanding. This unified design supports multitasking across diverse egocentric perception and synthesis tasks, including gaze prediction, egocentric camera tracking, and monocular depth estimation from egocentric video, and also serves as a generative model for conditional egocentric video synthesis. Across these tasks, EgoMLVM matches or outperforms specialist models while being an order of magnitude faster. To support the community and advance egocentric vision research, we will fully open-source EgoMLVM, along with the training and evaluation code.

Live content is unavailable. Log in and register to view live content