Poster
Head2Body: Body pose generation from Multi-sensory Head-mounted Inputs
Minh Tran · Hongda Mao · Qingshuang Chen · Yelin Kim
Generating body pose from head-mounted, egocentric inputs is essential for immersive VR/AR and assistive technologies, as it supports more natural interactions. However, the task is challenging due to limited visibility of body parts in first-person views and the sparseness of sensory data, with only a single device placed on the head. To address these challenges, we introduce Head2Body, a novel framework for body pose estimation that effectively combines IMU and visual data. First, we introduce a pre-trained IMU encoder, trained on over 1,700 hours of head-IMU data from wearable eyeglasses, to better capture detailed temporal motion cues given limited labeled egocentric pose data. For visual processing, we leverage large vision-language models (LVLMs) to segment body parts that appear sporadically in video frames to improve visual feature extraction. To better guide the pose generation process with sparse signals from only head-mounted devices, we incorporates a Vector Quantized Variational Autoencoder (VQ-VAE) to represent poses as discrete tokens, which capture high-frequency motion patterns and provide a more structured representation of body pose. Our experiments demonstrate the effectiveness of the proposed approach, yielding 8–13% gains over state-of-the-art baselines on four datasets: AMASS, KinPoly, GIMO, and EgoExo4D. By capturing subtle temporal dynamics and leveraging complementary sensory data, our approach advances accurate egocentric body pose estimation and sets a new benchmark for multi-modal, first-person motion tracking.
Live content is unavailable. Log in and register to view live content