MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips
Abstract
We interact with objects everyday, making the holistic 3D reconstruction of hands and objects from videos essential for applications like robotic in-hand manipulation. While most RGB-based methods rely on object templates, existing template-free approaches depend heavily on image observations, assuming full visibility of the object in the video. However, this assumption often does not hold in real-world scenarios, where cameras are fixed and objects are held in a static grip. As a result, parts of the object may remain unobserved, leading to unrealistic reconstructions when the object is under-observed. To this end, we introduce MagicHOI, a method for reconstructing hands and objects from short monocular interaction videos, even under limited views. Our key insight is that, although paired 3D hand-object data is extremely scarce, large-scale diffusion models like image-to-3D models offer abundant object supervision. This additional supervision can act as a prior to help regularize unseen object regions during hand interactions. Leveraging this insight, MagicHOI incorporates an existing image-to-3D diffusion model into a hand-object reconstruction framework. We then refine hand poses by incorporating hand-object interaction constraints. Our results demonstrate that MagicHOI significantly outperforms existing state-of-the-art template-free hand-object reconstruction methods. We also show that image-to-3D diffusion priors effectively regularize unseen object regions, enhancing 3D hand-object reconstruction. Furthermore, the improved object geometries lead to more accurate hand poses.Our code will be made available for research purposes.