Oral
Oral 3B: Human Modeling
Kalakaua Ballroom
NullSwap: Proactive Identity Cloaking Against Deepfake Face Swapping
Tianyi Wang · Shuaicheng Niu · Harry Cheng · xiao zhang · Yinglong Wang
Suffering from performance bottlenecks in passively detecting high-quality Deepfake images due to the advancement of generative models, proactive perturbations offer a promising approach to disabling Deepfake manipulations by inserting signals into benign images. However, existing proactive perturbation approaches remain unsatisfactory in several aspects: 1) visual degradation due to direct element-wise addition; 2) limited effectiveness against face swapping manipulation; 3) unavoidable reliance on white- and grey-box settings to involve generative models during training. In this study, we analyze the essence of Deepfake face swapping and argue the necessity of protecting source identities rather than target images, and we propose NullSwap, a novel proactive defense approach that cloaks source image identities and nullifies face swapping under a pure black-box scenario. We design an Identity Extraction module to obtain facial identity features from the source image, while a Perturbation Block is then devised to generate identity-guided perturbations accordingly. Meanwhile, a Feature Block extracts shallow-level image features, which are then fused with the perturbation in the Cloaking Block for image reconstruction. Furthermore, to ensure adaptability across different identity extractors in face swapping algorithms, we propose Dynamic Loss Weighting to adaptively balance identity losses. Experiments demonstrate the outstanding ability of our approach to fool various identity recognition models, outperforming state-of-the-art proactive perturbations in preventing face swapping models from generating images with correct source identities.
MaskControl: Spatio-Temporal Control for Masked Motion Synthesis
Ekkasit Pinyoanuntapong · Muhammad Usama Saleem · Korrawe Karunratanakul · Pu Wang · Hongfei Xue · Chen Chen · chuan guo · Junli Cao · Jian Ren · Sergey Tulyakov
Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, these models struggle to achieve high-precision control while maintaining high-quality motion generation. To address these challenges, we propose MaskControl, the first approach to introduce controllability to the generative masked motion model. Our approach introduces two key innovations. First, \textit{Logits Regularizer} implicitly perturbs logits at training time to align the distribution of motion tokens with the controlled joint positions, while regularizing the categorical token prediction to ensure high-fidelity generation. Second, \textit{Logit Optimization} explicitly optimizes the predicted logits during inference time, directly reshaping the token distribution that forces the generated motion to accurately align with the controlled joint positions. Moreover, we introduce \textit{Differentiable Expectation Sampling (DES)} to combat the non-differential distribution sampling process encountered by logits regularizer and optimization. Extensive experiments demonstrate that MaskControl outperforms state-of-the-art methods, achieving superior motion quality (FID decreases by ~77\%) and higher control precision (average error 0.91 vs. 1.08). Additionally, MaskControl enables diverse applications, including any-joint-any-frame control, body-part timeline control, and zero-shot objective control. Video visualization can be found at \url{https://anonymous-ai-agent.github.io/CAM}
HairCUP: Hair Compositional Universal Prior for 3D Gaussian Avatars
Byungjun Kim · Shunsuke Saito · Giljoo Nam · Tomas Simon · Jason Saragih · Hanbyul Joo · Junxuan Li
We present a universal prior model for 3D head avatar with hair compositionality. Existing approaches for building generalizable prior for 3D head avatar often model face and hair in a monolithic manner, where the inherent compositonality of the human head and hair is not considered. It is especially challenging for the monolithic model to self-discover the compositionality of face and hair when the dataset is not large enough. Moreover, extending the monolithic model for applications like swapping faces or hairstyles in 3D is not straightforward. Our prior model explicitly accounts for the compositionality of face and hair, learning their priors separately. To learn a disentangled latent spaces of face and hair of 3D head avatars, we propose a synthetic hairless data creation pipeline for dehairing the studio-captured dataset with estimated hairless geometry and hairless texture obtained from diffusion prior. Using a paired dataset of hair and hairless captures, disentangled prior models for face and hair can be trained by leveraging compositionality as an inductive bias to achieve disentanglement. Our model's inherent compositionality enables a seamless transfer of face and hair components between avatars while maintaining the subject's identity. Furthermore, we demonstrate that our model can be finetuned with a monocular capture to create hair-compositional 3D head avatars for unseen subjects, highlighting the practical applicability of our prior model in real-world scenarios.
Understanding Co-speech Gestures in-the-wild
Sindhu Hegde · K R Prajwal · Taein Kwon · Andrew Zisserman
Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations: (i) gesture-based retrieval, (ii) gestured word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs), across all three tasks. Further analysis reveals that speech and text modalities capture distinct gesture-related signals, underscoring the advantages of learning a shared tri-modal embedding space. All code, models, and data annotations will be released to support future research.
DPoser-X: Diffusion Model as Robust 3D Whole-body Human Pose Prior
Junzhe Lu · Jing Lin · Hongkun Dou · Ailing Zeng · Yue Deng · Xian Liu · Zhongang Cai · Lei Yang · YULUN ZHANG · Haoqian Wang · Ziwei Liu
We present DPoser-X, a diffusion-based prior model for 3D whole-body human poses. Building a versatile and robust full-body human pose prior remains challenging due to the inherent complexity of articulated human poses and the scarcity of high-quality whole-body pose datasets. To address these limitations, we introduce a Diffusion model as body Pose prior (DPoser) and extend it to DPoser-X for expressive whole-body human pose modeling.Our approach unifies various pose-centric tasks as inverse problems, solving them through variational diffusion sampling. To enhance performance on downstream applications, we introduce a novel truncated timestep scheduling method specifically designed for pose data characteristics. We also propose a masked training mechanism that effectively combines whole-body and part-specific datasets, enabling our model to capture interdependencies between body parts while avoiding overfitting to specific actions.Extensive experiments demonstrate DPoser-X's robustness and versatility across multiple benchmarks for body, hand, face, and full-body pose modeling. Our model consistently outperforms state-of-the-art alternatives, establishing a new benchmark for whole-body human pose prior modeling.
Teeth Reconstruction and Performance Capture Using a Phone Camera
Weixi Zheng · Jingwang Ling · Zhibo Wang · Quan Wang · Feng Xu
We present the first method for personalized dental shape reconstruction and teeth-inclusive facial performance capture using only a single phone camera. Our approach democratizes high-quality facial avatars through a non-invasive, low-cost setup by addressing the ill-posed monocular capture problem with an analysis-by-synthesis approach. We introduce a representation adaptation technique that maintains both mesh and SDF representations of teeth, enabling efficient differentiable rendering while preventing teeth-lip interpenetration. To overcome alignment challenges with similar-appearing dental components, we leverage foundation models for semantic teeth segmentation and design specialized optimization objectives. Our method addresses the challenging occlusions of teeth during facial performance through optimization strategies that leverage facial structural priors, while our semantic mask rendering loss with optimal transport-based matching ensures convergence despite significant variations in initial positioning. Code will be released.