Skip to yearly menu bar Skip to main content


Poster Session

Poster Session 3 & Exhibit Hall

Exhibit Hall I
Wed 22 Oct 1:45 p.m. PDT — 3:45 p.m. PDT
Abstract:
Chat is not available.

Wed 22 Oct. 13:45 - 15:45 PDT

#451
Highlight
AnimalClue: Recognizing Animals by their Traces

Risa Shinoda · Nakamasa Inoue · Iro Laina · Christian Rupprecht · Hirokatsu Kataoka

Wildlife observation plays an important role in biodiversity conservation, necessitating robust methodologies for monitoring wildlife populations and interspecies interactions. Recent advances in computer vision have significantly contributed to automating fundamental wildlife observation tasks, such as animal detection and species identification. However, accurately identifying species from indirect evidence like footprints and feces remains relatively underexplored, despite its importance in contributing to wildlife monitoring. To bridge this gap, we introduce AnimalClue, the first large-scale dataset for species identification from images of indirect evidence. Our dataset consists of 159,605 bounding boxes encompassing five categories of indirect clues: footprints, feces, eggs, bones, and feathers. It covers 968 species, 200 families, and 65 orders. Each image is annotated with species-level labels, bounding boxes or segmentation masks, and fine-grained trait information, including activity patterns and habitat preferences. Unlike existing datasets primarily focused on direct visual features (e.g., animal appearances), AnimalClue presents unique challenges for classification, detection, and instance segmentation tasks due to the need for recognizing more detailed and subtle visual features. In our experiments, we extensively evaluate representative vision models and identify key challenges in animal identification from their traces. We will make the dataset publicly available for research purpose.

Wed 22 Oct. 14:15 - 16:15 PDT

#1
Error Recognition in Procedural Videos using Generalized Task Graph

Shih-Po Lee · Ehsan Elhamifar

Understanding user actions and their possible mistakes is essential for successful operation of task assistants. In this paper, we develop a unified framework for joint temporal action segmentation and error recognition (recognizing when and which type of error happens) in procedural task videos. We propose a Generalized Task Graph (GTG) whose nodes encode correct steps and background (task-irrelevant actions). We then develop a GTG-Video Alignment algorithm (GTG2Vid) to jointly segment videos into actions and detect frames containing errors. Given that it is infeasible to gather many videos and their annotations for different types of errors, we study a framework that only requires normal (error-free) videos during training. More specifically, we leverage large language models (LLMs) to obtain error descriptions and subsequently use video-language models (VLMs) to generate visually-aligned textual features, which we use for error recognition. We then propose an Error Recognition Module (ERM) to recognize the error frames predicted by GTG2Vid using the generated error features. By extensive experiments on two egocentric datasets of EgoPER and CaptainCook4D, we show that our framework outperforms other baselines on action segmentation, error detection and recognition.

Wed 22 Oct. 14:15 - 16:15 PDT

#2
MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps

Jiahui Lei · Kyle Genova · George Kopanas · Noah Snavely · Leonidas Guibas

This paper addresses the challenge of learning semantically and functionally meaningful 3D motion priors from real-world videos, in order to enable prediction of future 3D scene motion from a single input image.We propose a novel pixel-aligned Motion Map (MoMap) representation for 3D scene motion, which can be generated from existing generative image models to facilitate efficient and effective motion prediction. To learn meaningful distributions over motion, we create a large-scale database of MoMaps from over 50,000 real videos and train a diffusion model on these representations. Our motion generation not only synthesizes trajectories in 3D but also suggests a new pipeline for 2D video synthesis: first generate a MoMap in 3D, then warp an image accordingly and complete the warped point-based renderings. Experimental results demonstrate that our approach generates plausible and semantically consistent 3D scene motion.

Wed 22 Oct. 14:15 - 16:15 PDT

#3
LongAnimation: Long Animation Generation with Dynamic Global-Local Memory

Nan Chen · Mengqi Huang · Yihao Meng · Zhendong Mao

Animation colorization is a crucial part of real animation industry production. Long animation colorization has high labor costs. Therefore, automated long animation colorization based on the video generation model has significant research value. Existing studies are limited to short-term colorization. These studies adopt a local paradigm, fusing overlapping features to achieve smooth transitions between local segments. However, the local paradigm neglects global information, failing to maintain long-term color consistency. In this study, we argue that ideal long-term color consistency can be achieved through a dynamic global-local paradigm, i.e., dynamically extracting global color consistent features relevant to the current generation. Specifically, we propose LongAnimation, a novel framework, which mainly includes a SketchDiT, a Dynamic Global-Local Memory (DGLM), and a Color Consistency Reward. The SketchDiT captures hybrid reference features to support the DGLM module. The DGLM module employs a long video understanding model to dynamically compress global historical features and adaptively fuse them with the current generation features. To refine the color consistency, we introduce a Color Consistency Reward. During inference, we propose a color consistency fusion to smooth the video segment transition. Extensive experiments on both short-term (14 frames) and long-term (average 500 frames) animations show the effectiveness of LongAnimation in maintaining short-term and long-term color consistency for open-domain animation colorization task.

Wed 22 Oct. 14:15 - 16:15 PDT

#4
VIGFace: Virtual Identity Generation for Privacy-Free Face Recognition Dataset

Minsoo Kim · Min-Cheol Sagong · Gi Pyo Nam · Junghyun Cho · Ig-Jae Kim

Deep learning-based face recognition continues to face challenges due to its reliance on huge datasets obtained from web crawling, which can be costly to gather and raise significant real-world privacy concerns. To address this issue, we propose VIGFace, a novel framework capable of generating synthetic facial images. Our idea originates from pre-assigning virtual identities in the feature space. Initially, we train the face recognition model using a real face dataset and create a feature space for both real and virtual identities, where virtual prototypes are orthogonal to other prototypes. Subsequently, we train the diffusion model based on the established feature space, enabling it to generate authentic human face images from real prototypes and synthesize virtual face images from virtual prototypes.Our proposed framework provides two significant benefits. Firstly, it shows clear separability between existing individuals and virtual face images, allowing one to create synthetic images with confidence and without concerns about privacy and portrait rights. Secondly, it ensures improved performance through data augmentation by incorporating real existing images. Extensive experiments demonstrate the superiority of our virtual face dataset and framework, outperforming the previous state-of-the-art on various face recognition benchmarks.

Wed 22 Oct. 14:15 - 16:15 PDT

#5
COVTrack: Continuous Open-Vocabulary Tracking via Adaptive Multi-Cue Fusion

Zekun Qian · Ruize Han · Zhixiang Wang · Junhui Hou · Wei Feng

Open-Vocabulary Multi-Object Tracking (OVMOT) aims to detect and track diverse object categories in videos, including both seen (base) and unseen (novel) categories. Current methods rely on appearance features from generated image pairs or utilize the discontinuous annotations of the video dataset (TAO) for training, primarily due to the lack of available continuous annotated video datasets for OVMOT. This limitation affects their effectiveness, since continuous target trajectories are necessary for robust tracker learning.In this work, we propose the C-TAO dataset, which provides a continuous version of TAO, thereby constructing the first continuous annotated training dataset for OVMOT. This addresses the previous limitations in training data availability. Additionally, we introduce COVTrack, a unified framework that effectively integrates motion and semantic features with appearance features, in which the multi-cue feature aggregation strategy dynamically aggregates and balances these features, based on the confidence estimation from both intra-frame and inter-frame contexts.Our proposed framework significantly improves OVMOT performance, establishing COVTrack as a state-of-the-art solution on OVMOT benchmarks.

Wed 22 Oct. 14:15 - 16:15 PDT

#6
CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation

Xiangyang Luo · Ye Zhu · Yunfei Liu · Lijian Lin · Cong Wan · Zijian Cai · Yu Li · Shao-Lun Huang

Video face swapping aims to address two primary challenges: effectively transferring the source identity to the target video and accurately preserving the dynamic attributes of the target face, such as head poses, facial expressions, lip-sync, \etc.Existing methods mainly focus on achieving high-quality identity transfer but often fall short in maintaining the dynamic attributes of the target face, leading to inconsistent results. We attribute this issue to the inherent coupling of facial appearance and motion in videos. To address this, we propose CanonSwap, a novel video face-swapping framework that decouples motion information from appearance information. Specifically, CanonSwap first eliminates motion-related information, enabling identity modification within a unified canonical space. Subsequently, the swapped identity is reintegrated into the original video space, ensuring the preservation of the target face's dynamic attributes. To further achieve precise identity transfer with minimal artifacts and enhanced realism, we design a Partial Identity Modulation module that adaptively integrates source identity features using a spatial mask to restrict modifications to facial regions.Additionally, we introduce several fine-grained synchronization metrics to comprehensively evaluate the performance of video face swapping methods.Extensive experiments demonstrate that our method significantly outperforms existing approaches in terms of visual quality, temporal consistency, and identity preservation.

Wed 22 Oct. 14:15 - 16:15 PDT

#7
RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints

Yiran Qin · Li Kang · Xiufeng Song · Zhenfei Yin · Xiaohong Liu · Xihui Liu · Ruimao Zhang · LEI BAI

Designing effective embodied multi-agent systems is critical for solving complex real-world tasks across domains. Due to the complexity of multi-agent embodied systems, existing methods fail to automatically generate safe and efficient training data for such systems. To this end, we propose the concept of compositional constraints for embodied multi-agent systems, addressing the challenges arising from collaboration among embodied agents. We design various interfaces tailored to different types of constraints, enabling seamless interaction with the physical world. Leveraging compositional constraints and specifically designed interfaces, we develop an automated data collection framework for embodied multi-agent systems and introduce the first benchmark for embodied multi-agent manipulation, RoboFactory. Based on RoboFactory benchmark, we adapt and evaluate the method of imitation learning and analyzed its performance in different difficulty agent tasks. Furthermore, we explore the architectures and training strategies for multi-agent imitation learning, aiming to build safe and efficient embodied multi-agent systems.

Wed 22 Oct. 14:15 - 16:15 PDT

#8
MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

Lixing Xiao · Shunlin Lu · Huaijin Pi · Ke Fan · Liang Pan · Yueer Zhou · Ziyong Feng · Xiaowei Zhou · Sida Peng · Jingbo Wang

This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. The code will be released for the reproducibility.

Wed 22 Oct. 14:15 - 16:15 PDT

#9
RapVerse: Coherent Vocals and Whole-Body Motion Generation from Text

Jiaben Chen · Xin Yan · Yihang Chen · Siyuan Cen · Zixin Wang · Qinwei Ma · Haoyu Zhen · Kaizhi Qian · Lie Lu · Chuang Gan

In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rapping vocals, lyrics, and high-quality 3D holistic body meshes. With the RapVerse dataset, we investigate the extent to which scaling autoregressive multimodal transformers across language, audio, and motion can enhance the coherent and realistic generation of vocals and whole-body human motions. For modality unification, a vector-quantized variational autoencoder is employed to encode whole-body motion sequences into discrete motion tokens, while a vocal-to-unit model is leveraged to obtain quantized audio tokens preserving content, prosodic information and singer identity. By jointly performing transformer modeling on these three modalities in a unified way, our framework ensures a seamless and realistic blend of vocals and human motions. Extensive experiments demonstrate that our unified generation framework not only produces coherent and realistic singing vocals alongside human motions directly from textual inputs, but also rivals the performance of specialized single-modality generation systems, establishing new benchmarks for joint vocal-motion generation. We encourage readers to watch the supplementary video with audio enabled to fully experience the qualitative results.

Wed 22 Oct. 14:15 - 16:15 PDT

#10
Highlight
GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation

Wentao Hu · Shunkai Li · Ziqiao Peng · Haoxian Zhang · Fan Shi · Xiaoqiang Liu · Pengfei Wan · Di ZHANG · Hui Tian

Creating high-quality, generalizable speech-driven 3D talking heads remains a persistent challenge. Previous methods achieve satisfactory results for fixed viewpoints and small-scale audio variations, but they struggle with large head rotations and out-of-distribution (OOD) audio. Moreover, they are constrained by the need for time-consuming, identity-specific training. We believe the core issue lies in the lack of sufficient 3D priors, which limits the extrapolation capabilities of synthesized talking heads. To address this, we propose GGTalker, which synthesizes talking heads through a combination of generalizable priors and identity-specific adaptation. We introduce a two-stage Prior-Adaptation training strategy to learn Gaussian head priors and adapt to individual characteristics. We train Audio-Expression and Expression-Visual priors to capture the universal patterns of lip movements and the general distribution of head textures. During the Customized Adaptation, individual speaking styles and texture details are precisely modeled. Additionally, we introduce a color MLP to generate fine-grained, motion-aligned textures and a Body Inpainter to blend rendered results with the background, producing indistinguishable, photorealistic video frames. Comprehensive experiments show that GGTalker achieves state-of-the-art performance in rendering quality, 3D consistency, lip-sync accuracy, and training efficiency.

Wed 22 Oct. 14:15 - 16:15 PDT

#11
RoboPearls: Editable Video Simulation for Robot Manipulation

Tao Tang · Likui Zhang · Youpeng Wen · Kaidong Zhang · Jia-Wang Bian · xia zhou · Tianyi Yan · Kun Zhan · Peng Jia · Hefeng Wu · Liang Lin · Xiaodan Liang

The development of generalist robot manipulation policies has seen significant progress, driven by large-scale demonstration data across diverse environments. However, the high cost and inefficiency of collecting real-world demonstrations hinder the scalability of data acquisition. While existing simulation platforms enable controlled environments for robotic learning, the challenge of bridging the sim-to-real gap remains. To address these challenges, we propose RoboPearls, an editable video simulation framework for robotic manipulation. Built on 3D Gaussian Splatting (3DGS), RoboPearls enables the construction of photo-realistic, view-consistent simulations from demonstration videos, and supports a wide range of simulation operators, including various object manipulations, powered by advanced modules like Incremental Semantic Distillation (ISD) and 3D regularized NNFM Loss (3D-NNFM). Moreover, by incorporating large language models (LLMs), RoboPearls automates the simulation production process in a user-friendly manner through flexible command interpretation and execution. Furthermore, RoboPearls employs a vision-language model (VLM) to analyze robotic learning issues to close the simulation loop for performance enhancement. To demonstrate the effectiveness of RoboPearls, we conduct extensive experiments on multiple datasets and scenes, including RLBench, COLOSSEUM, Ego4D, Open X-Embodiment, and a real-world robot, which demonstrate our satisfactory simulation performance.

Wed 22 Oct. 14:15 - 16:15 PDT

#12
GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions

Xiaomeng Chu · Jiajun Deng · Guoliang You · Wei Liu · Xingchen Li · Jianmin Ji · Yanyong Zhang

Flexible instruction-guided 6-DoF grasping is a significant yet challenging task for real-world robotic systems. Existing methods utilize the contextual understanding capabilities of the large language models (LLMs) to establish mappings between expressions and targets, allowing robots to comprehend users' intentions in the instructions. However, the LLM's knowledge about objects' physical properties remains underexplored despite its tight relevance to grasping. In this work, we propose GraspCoT, a 6-DoF grasp detection framework that integrates a Chain-of-Thought (CoT) reasoning mechanism oriented to physical properties, guided by auxiliary question-answering (QA) tasks. Particularly, we design a set of QA templates to enable hierarchical reasoning that includes three stages: target parsing, physical property analysis, and grasp action selection. Moreover, GraspCoT presents a unified multimodal LLM architecture, which encodes multi-view observations of 3D scenes into 3D-aware visual tokens, and then jointly embeds these visual tokens with CoT-derived textual tokens within LLMs to generate grasp pose predictions. Furthermore, we present IntentGrasp, a large-scale benchmark that fills the gap in public datasets for multi-object grasp detection under diverse and indirect verbal commands. Extensive experiments on IntentGrasp demonstrate the superiority of our method, with additional validation in real-world robotic applications confirming its practicality. Codes and data will be released.

Wed 22 Oct. 14:15 - 16:15 PDT

#13
Highlight
Asynchronous Event Error-Minimizing Noise for Safeguarding Event Dataset

Ruofei WANG · Peiqi Duan · Boxin Shi · Renjie Wan

With more event datasets being released online, safeguarding the event dataset against unauthorized usage has become a serious concern for data owners. Unlearnable Examples are proposed to prevent the unauthorized exploitation of image datasets. However, it's unclear how to create unlearnable asynchronous event streams to prevent event misuse. In this work, we propose the first unlearnable event stream generation method to prevent unauthorized training from event datasets. A new form of asynchronous event error-minimizing noise is proposed to perturb event streams, tricking the unauthorized model into learning embedded noise instead of realistic features. To be compatible with the sparse event, a projection strategy is presented to sparsify the noise to render our unlearnable event streams (UEvs). Extensive experiments demonstrate that our method effectively protects event data from unauthorized exploitation, while preserving their utility for legitimate use. We hope our UEvs contribute to the advancement of secure and trustworthy event dataset sharing.

Wed 22 Oct. 14:15 - 16:15 PDT

#14
Exploiting Diffusion Prior for Task-driven Image Restoration

Jaeha Kim · Junghun Oh · Kyoung Mu Lee

Task-driven image restoration (TDIR) has recently emerged to address performance drops in high-level vision tasks caused by low-quality (LQ) inputs. The goal of TDIR is to improve both visual quality and task performance. Previous TDIR methods struggle to handle practical scenarios in which images are degraded by multiple complex factors, leaving minimal clues for restoration. This leads us to leverage the diffusion prior, one of the most powerful image priors. However, while the diffusion prior can help generate visually plausible results, using it to restore task-relevant details remains challenging, even when combined with state-of-the-art TDIR methods. To address this, we propose EDTR, the first TDIR method that incorporates diffusion prior in ways that harness its strength to restore task-relevant details. Specifically, we propose directly leveraging useful clues from LQ images in the diffusion process by generating from pre-restored LQ images with mild noise added. Moreover, we suggest one-step denoising to prevent the generation of redundant details that dilute crucial task-related information. We demonstrate that our method effectively utilizes diffusion prior to restore task-relevant details, significantly enhancing task performance and visual quality across diverse tasks with complex degradations.

Wed 22 Oct. 14:15 - 16:15 PDT

#15
Bridging Class Imbalance and Partial Labeling via Spectral-Balanced Energy Propagation for Skeleton-based Action Recognition

Yandan Wang · Chenqi Guo · Yinglong Ma · Jiangyan Chen · Yuan Gao · Weiming Dong

Skeleton-based action recognition faces class imbalance and insufficient labeling problems in real-world applications. Existing methods typically address these issues separately, lacking a unified framework that can effectively handle both issues simultaneously while considering their inherent relationships. Our theoretical analysis reveals two fundamental connections between these problems. First, class imbalance systematically shifts the eigenvalue spectrum of normalized affinity matrices, compromising both convergence and accuracy of label propagation. Second, boundary samples are critical for model training under imbalanced conditions but are often mistakenly excluded by conventional reliability metrics, which focus on relative class differences rather than holistic connectivity patterns. Built upon these theoretical findings, we propose SpeLER ($\textbf{Spe}$ctral-balanced $\textbf{L}$abel Propagation with $\textbf{E}$nergy-based Tightened $\textbf{R}$eliability), which introduces a spectral balancing technique that explicitly counteracts spectral shifts by incorporating class distribution information. Meanwhile, a propagation energy-based tightened reliability measure is proposed to better preserve crucial boundary samples by evaluating holistic connectivity patterns. Extensive experiments on six public datasets demonstrate that SpeLER consistently outperforms state-of-the-art methods, validating both our theoretical findings and practical effectiveness.

Wed 22 Oct. 14:15 - 16:15 PDT

#16
Highlight
Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis

Kaiyang Ji · Ye Shi · Zichen Jin · Kangyi Chen · Lan Xu · Yuexin Ma · Jingyi Yu · Jingya Wang

Real-time synthesis of physically plausible human interactions remains a critical challenge for immersive VR/AR systems and humanoid robotics. While existing methods demonstrate progress in kinematic motion generation, they often fail to address the fundamental tension between real-time responsiveness, physical feasibility, and safety requirements in dynamic human-machine interactions. We introduce Human-X, a novel framework designed to enable immersive and physically plausible human interactions across diverse entities, including human-avatar, human-humanoid, and human-robot systems. Unlike existing approaches that focus on post-hoc alignment or simplified physics, our method jointly predicts actions and reactions in real-time using an auto-regressive reaction diffusion planner, ensuring seamless synchronization and context-aware responses. To enhance physical realism and safety, we integrate an actor-aware motion tracking policy trained with reinforcement learning, which dynamically adapts to interaction partners’ movements while avoiding artifacts like foot sliding and penetration. Extensive experiments on the Inter-X and InterHuman datasets demonstrate significant improvements in motion quality, interaction continuity, and physical plausibility over state-of-the-art methods. Our framework is validated in real-world applications, including virtual reality interface for human-robot interaction, showcasing its potential for advancing human-robot collaboration.

Wed 22 Oct. 14:15 - 16:15 PDT

#17
SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data

Xilin He · Cheng Luo · Xiaole Xian · Bing Li · Siyang Song · Muhammad Haris Khan · Weicheng Xie · Linlin Shen · Zongyuan Ge · Bernard Ghanem · Xiangyu Yue

Facial expression datasets remain limited in scale due to privacy concerns, the subjectivity of annotations, and the labor-intensive nature of data collection. This limitation poses a significant challenge for developing modern deep learning-based facial expression analysis models, particularly foundation models, that rely on large-scale data for optimal performance. To tackle the overarching and complex challenge, we introduce SynFER (Synthesis of Facial Expressions with Refined Control), a novel framework for synthesizing facial expression image data based on high-level textual descriptions as well as more fine-grained and precise control through facial action units. To ensure the quality and reliability of the synthetic data, we propose a semantic guidance technique to steer the generation process and a pseudo-label generator to help rectify the facial expression labels for the synthetic images. To demonstrate the generation fidelity and the effectiveness of the synthetic data from SynFER, we conduct extensive experiments on representation learning using both synthetic data and real-world data. Results validate the efficacy of our approach and the synthetic data. Notably, our approach achieves a 67.23% classification accuracy on AffectNet when training solely with synthetic data equivalent to the AffectNet training set size, which increases to 69.84% when scaling up to five times the original size.

Wed 22 Oct. 14:15 - 16:15 PDT

#18
ImHead: A Large-scale Implicit Morphable Model for Localized Head Modeling

Rolandos Alexandros Potamias · Stathis Galanakis · Jiankang Deng · Athanasios Papaioannou · Stefanos Zafeiriou

Over the last years, 3D morphable models (3DMMs) have emerged as a state-of-the-art methodology for modeling and generating expressive 3D avatars. However, given their reliance on a strict topology, along with their linear nature, they struggle to represent complex full-head shapes. Following the advent of deep implicit functions (DIFs), we propose imHead, a novel implicit 3DMM that not only models expressive 3D head avatars but also facilitates localized editing of the facial features. Previous methods directly divided the latent space into local components accompanied by an identity encoding to capture the global shape variations, leading to expensive latent sizes. In contrast, we retain a single compact identity space and introduce an intermediate region-specific latent representation to enable local edits. To train imHead, we curate a large-scale dataset of over 4,500 identities, making a step-towards large scale 3D head modeling. Under a series of experiments we demonstrate the expressive power of the proposed model to represent diverse identities and expressions outperforming previous approaches. Additionally, the proposed approach provides an interpretable solution for 3D face manipulation, allowing the user to make localized edits. Models and data will be made publicly available for research purposes.

Wed 22 Oct. 14:15 - 16:15 PDT

#19
Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance

Li Hu · wang yuan · Zhen Shen · Xin Gao · Dechao Meng · Li'an Zhuo · Peng Zhang · Bang Zhang · Liefeng Bo

Recent character image animation methods based on diffusion models, such as Animate Anyone, have made significant progress in generating consistent and generalizable character animations. However, these approaches fail to produce reasonable associations between characters and their environments. To address this limitation, we introduce Animate Anyone 2, aiming to animate characters with environment affordance. Beyond extracting motion signals from source video, we additionally capture environmental representations as conditional inputs. The environment is formulated as the region with the exclusion of characters and our model generates characters to populate these regions while maintaining coherence with the environmental context. We propose a shape-agnostic mask strategy that more effectively characterizes the relationship between character and environment. Furthermore, to enhance the fidelity of object interactions, we leverage an object guider to extract features of interacting objects and employ spatial blending for feature injection. We also introduce a pose modulation strategy that enables the model to handle more diverse motion patterns. Experimental results demonstrate the superior performance of the proposed method.

Wed 22 Oct. 14:15 - 16:15 PDT

#20
PVMamba: Parallelizing Vision Mamba via Dynamic State Aggregation

Fei Xie · Zhongdao Wang · Weijia Zhang · Chao Ma

Mamba, an architecture with RNN-like sequence modeling of state space model (SSM), has demonstrated promising capabilities in long-range modeling with high efficiency. However, Mamba models struggle with structured 2D visual data using sequential computing, thereby lagging behind their attention-based counterparts. In this paper, we propose a Parallel Vision Mamba (PVMamba), a novel SSM architecture tailored for visual data. PVMamba encompasses two key designs: 1) Based on the sparsity and adjacency of visual signals, we parallelize the sequential computing through three core steps, termed Dynamic State Aggregation (DSA), i.e., parallelization, spatial alignment, and vectorized aggregation. DSA generates the hidden state in SSM by a feasible spatial aggregation, thereby overcoming the inherent sequential constraints. 2) Along with maintaining linear computational complexity, we apply a dynamic operator to learn the spatial samplings for each hidden state. To further boost the local modeling capability, we restrict the dynamic operator to the neighboring pixels in shallow layers. We also devise a layer multiplexing technique to stabilize the training and reduce the learning redundancy. PVMamba is a versatile backbone network with dynamic operators for various vision tasks, such as image classification and dense prediction. Extensive experiments show that PVMamba achieves state-of-the-art performance on a range of benchmarks. Our code will be released.

Wed 22 Oct. 14:15 - 16:15 PDT

#21
FlowStyler: Artistic Video Stylization via Transformation Fields Transports

YuNing Gong · Jiaming Chen · Xiaohua Ren · Yuanjun Liao · Yanci Zhang

Contemporary video stylization approaches struggle to achieve artistic stylization while preserving temporal consistency. While generator-based methods produce visually striking stylized results, they suffer from flickering artifacts in dynamic motion scenarios and require prohibitive computational resources. Conversely, non-generative techniques frequently show either temporal inconsistency or inadequate style preservation.We address these limitations by adapting the physics-inspired transport principles from the Transport-based Neural Style Transfer (TNST) framework (originally developed for volumetric fluid stylization) to enforce inter-frame consistency in video stylization.Our framework employs two complementary transformation fields for artistic stylization: a geometric stylization velocity field governing deformation and an orthogonality-regularized color transfer field managing color adaptations. We further strengthen temporal consistency through two key enhancements to our field architecture: a momentum-preserving strategy mitigating vibration artifacts, and an occlusion-aware temporal lookup strategy addressing motion trailing artifacts. Extensive experiments demonstrate FlowStyler's superior performance across dual dimensions: Compared to generator-based approaches, we achieve 4$\times$ lower short-term warping errors, while maintaining comparable style fidelity; Against non-generative methods, FlowStyler attains 22\% higher style fidelity with slightly improved temporal stability.

Wed 22 Oct. 14:15 - 16:15 PDT

#22
Controllable and Expressive One-Shot Video Head Swapping

Chaonan Ji · Jinwei Qi · Peng Zhang · Bang Zhang · Liefeng Bo

In this paper, we propose a novel diffusion-based multi-condition controllable framework for video head swapping, which seamlessly transplant a human head from a static image into a dynamic video, while preserving the original body and background of target video, and further allowing to tweak head expressions and movements during swapping as needed. Existing face-swapping methods mainly focus on localized facial replacement neglecting holistic head morphology, while head-swapping approaches struggling with hairstyle diversity and complex backgrounds, and none of these methods allow users to modify the transplanted head’s expressions after swapping. To tackle these challenges, our method incorporates several innovative strategies through a unified latent diffusion paradigm. 1) Identity-preserving context fusion: We propose a shape-agnostic mask strategy to explicitly disentangle foreground head identity features from background/body contexts, combining hair enhancement strategy to achieve robust holistic head identity preservation across diverse hair types and complex backgrounds. 2) Expression-aware landmark retargeting and editing: We propose a disentangled 3DMM-driven retargeting module that decouples identity, expression, and head poses, minimizing the impact of original expressions in input images and supporting expression editing. While a scale-aware retargeting strategy is further employed to minimize cross-identity expression distortion for higher transfer precision. Experimental results demonstrate that our method excels in seamless background integration while preserving the identity of the source portrait, as well as showcasing superior expression transfer capabilities applicable to both real and virtual characters.

Wed 22 Oct. 14:15 - 16:15 PDT

#23
Multi-modal Multi-platform Person Re-Identification: Benchmark and Method

Ruiyang Ha · Songyi Jiang · Bin Li · Bikang Pan · Yihang Zhu · Junjie Zhang · Xiatian Zhu · Shaogang Gong · Jingya Wang

Conventional person re-identification (ReID) research is often limited to single-modality sensor data from static cameras, which fails to address the complexities of real-world scenarios where multi-modal signals are increasingly prevalent. For instance, consider an urban ReID system integrating stationary RGB cameras, nighttime infrared sensors, and UAVs equipped with dynamic tracking capabilities. Such systems face significant challenges due to variations in camera perspectives, lighting conditions, and sensor modalities, hindering effective person ReID.To address these challenges, we introduce the MP-ReID benchmark, a novel dataset designed specifically for multi-modality and multi-platform ReID. This benchmark uniquely compiles data from 1,930 identities across diverse modalities, including RGB, infrared, and thermal imaging, captured by both UAVs and ground-based cameras in indoor and outdoor environments.Building on this benchmark, we introduce Uni-Prompt ReID, a framework with specific-designed prompts, tailored for cross-modality and cross-platform scenarios. Our method consistently outperforms state-of-the-art approaches, establishing a robust foundation for future research in complex and dynamic ReID environments. Additionally, our dataset will be made publicly available to support further advancements.

Wed 22 Oct. 14:15 - 16:15 PDT

#24
HERO: Human Reaction Generation from Videos

Chengjun Yu · Wei Zhai · Yuhang Yang · Yang Cao · Zheng-Jun Zha

Human reaction generation represents a significant research domain for interactive AI, as humans constantly interact with their surroundings. Previous works focus mainly on synthesizing the reactive motion given a human motion sequence. This paradigm limits interaction categories to human-human interactions and ignores emotions that may influence reaction generation. In this work, we propose to generate 3D human reactions from RGB videos, which involves a wider range of interaction categories and naturally provides information about expressions that may reflect the subject's emotions. To cope with this task, we present HERO, a simple yet powerful framework for Human rEaction geneRation from videOs. HERO considers both global and frame-level local representations of the video to extract the interaction intention, and then uses the extracted interaction intention to guide the synthesis of the reaction. Besides, local visual representations are continuously injected into the model to maximize the exploitation of the dynamic properties inherent in videos. Furthermore‌, the ViMo dataset containing paired Video-Motion data is collected to support the task. In addition to human-human interactions, these video-motion pairs also cover animal-human interactions and scene-human interactions. Extensive experiments demonstrate the superiority of our methodology. The code and dataset will be publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#25
Highlight
Mixture of Experts Guided by Gaussian Splatters Matters: A new Approach to Weakly-Supervised Video Anomaly Detection

Giacomo D'Amicantonio · Snehashis Majhi · Quan Kong · Lorenzo Garattoni · Gianpiero Francesca · Egor Bondarev · Francois Bremond

Video Anomaly Detection (VAD) is a challenging task due to the variability of anomalous events and the limited availability of labeled data. Under the Weakly-Supervised VAD (WSVAD) paradigm, only video-level labels are provided during training, while predictions are made at the frame level. Although state-of-the-art models perform well on simple anomalies (e.g., explosions), they struggle with complex real-world events (e.g., shoplifting). This difficulty stems from two key issues: (1) the inability of current models to address the diversity of anomaly types, as they process all categories with a shared model, overlooking category-specific features; and (2) the weak supervision signal, which lacks precise temporal information, limiting the ability to capture nuanced anomalous patterns blended with normal events. To address these challenges, we propose Gaussian Splatting-guided Mixture of Experts (GS-MoE), a novel framework that employs a set of expert models, each specialized in capturing specific anomaly types. These experts are guided by a temporal Gaussian splatting loss, enabling the model to leverage temporal consistency and enhance weak supervision. The Gaussian splatting approach encourages a more precise and comprehensive representation of anomalies by focusing on temporal segments most likely to contain abnormal events. The predictions from these specialized experts are integrated through a mixture-of-experts mechanism to model complex relationships across diverse anomaly patterns. Our approach achieves state-of-the-art performance, with a 91.58\% AUC on the UCF-Crime dataset, and demonstrates superior results on XD-Violence and MSAD datasets. By leveraging category-specific expertise and temporal guidance, GS-MoE sets a new benchmark for VAD under weak supervision.

Wed 22 Oct. 14:15 - 16:15 PDT

#26
What If: Understanding Motion Through Sparse Interactions

Stefan A. Baumann · Nick Stracke · Timy Phan · Björn Ommer

Understanding the dynamics of a physical scene involves reasoning about the diverse ways it can potentially change, especially as a result of local interactions. We present the Flow Poke Transformer (FPT), a novel framework for directly predicting the distribution of local motion, conditioned on sparse interactions termed "pokes". Unlike traditional methods that typically only enable dense sampling of a single realization of scene dynamics, FPT provides an interpretable directly accessible representation of multi-modal scene motion, its dependency on physical interactions and the inherent uncertainties of scene dynamics. We also evaluate our model on several downstream tasks to enable comparisons with prior methods and highlight the flexibility of our approach. On dense face motion generation, our generic pre-trained model surpasses specialized baselines. FPT can be fine-tuned in strongly out-of-distribution tasks such as synthetic datasets to enable significant improvements over in-domain methods in articulated object motion estimation. Additionally, predicting explicit motion distributions directly enables our method to achieve competitive performance on tasks like moving part segmentation from pokes which further demonstrates the versatility of our FPT.

Wed 22 Oct. 14:15 - 16:15 PDT

#27
PROGRESSOR: A Perceptually Guided Reward Estimator with Self-Supervised Online Refinement

Tewodros W. Ayalew · Xiao Zhang · Kevin Y Wu · Tianchong Jiang · Michael Maire · Matthew Walter

We present PROGRESSOR, a novel framework that learns a task-agnostic reward function from videos, enabling policy training through goal-conditioned reinforcement learning (RL) without manual supervision. Underlying this reward is an estimate of the distribution over task progress as a function of the current, initial, and goal observations that is learned in a self-supervised fashion. Crucially, PROGRESSOR refines rewards adversarially during online RL training by pushing back high-variance predictions, to mitigate distribution shift inherent in non-expert observations. Utilizing this progress prediction as a dense reward together with an adversarial push-back, we show that PROGRESSOR enables robots to learn complex behaviors without any external supervision. Pretrained on large-scale egocentric human video from EPIC-KITCHENS, PROGRESSOR requires no fine-tuning on in-domain task-specific data for generalization to real-robot offline RL under noisy demonstrations, outperforming contemporary methods that provide dense visual reward for robotic learning. Our findings highlight the potential of PROGRESSOR for scalable robotic applications where direct action labels and task-specific rewards are not readily available.

Wed 22 Oct. 14:15 - 16:15 PDT

#28
How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach

Chirui CHANG · Jiahui Liu · Zhengzhe Liu · Xiaoyang Lyu · Yi-Hua Huang · Xin Tao · Pengfei Wan · Di ZHANG · Xiaojuan Qi

Recent advancements in video diffusion models enable the generation of photorealistic videos with impressive 3D consistency and temporal coherence. However, the extent to which these AI-generated videos simulate the 3D visual world remains underexplored. In this paper, we introduce Learned 3D Evaluation (L3DE), an objective, quantifiable, and interpretable method for assessing AI-generated videos’ ability to simulate the real world in terms of 3D visual qualities and consistencies, without requiring manually labeled defects or quality annotations. Instead of relying on 3D reconstruction, which is prone to failure with in-the-wild videos, L3DE employs a 3D convolutional network, trained on monocular 3D cues of motion, depth, and appearance, to distinguish real from synthetic videos. Confidence scores from L3DE quantify the gap between real and synthetic videos in terms of 3D visual coherence, while a gradient-based visualization pinpoints unrealistic regions, improving interpretability. We validate L3DE through extensive experiments, demonstrating strong alignment with 3D reconstruction quality and human judgments. Our evaluations on leading generative models (e.g., Sora, MiniMax, and Kling) reveal persistent simulation gaps and subtle inconsistencies. Beyond generative video assessment, L3DE extends to broader applications: benchmarking video generation models, serving as a deepfake detector, and enhancing video synthesis by inpainting flagged inconsistencies.

Wed 22 Oct. 14:15 - 16:15 PDT

#29
UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation

Chaitanya Patel · Hiroki Nakamura · Yuta Kyuragi · Kazuki Kozuka · Juan Carlos Niebles · Ehsan Adeli

Egocentric human motion generation and forecasting with scene-context is crucial for enhancing AR/VR experiences, improving human-robot interaction, advancing assistive technologies, and enabling adaptive healthcare solutions by accurately predicting and simulating movement from a first-person perspective. However, existing methods primarily focus on third-person motion synthesis with structured 3D scene contexts, limiting their effectiveness in real-world egocentric settings where limited field of view, frequent occlusions, and dynamic cameras hinder scene perception. To bridge this gap, we introduce Egocentric Motion Generation and Egocentric Motion Forecasting, two novel tasks that utilize first-person images for scene-aware motion synthesis without relying on explicit 3D scene. We propose UniEgoMotion, a unified conditional motion diffusion model with a novel head-centric motion representation tailored for egocentric devices. UniEgoMotion’s simple yet effective design supports egocentric motion reconstruction, forecasting, and generation from first-person visual inputs in a unified framework. Unlike previous works that overlook scene semantics, our model effectively extracts image-based scene context to infer plausible 3D motion. To facilitate training, we introduce EE4D-Motion, a large-scale dataset derived from EgoExo4D, augmented with pseudo-ground-truth 3D motion annotations. UniEgoMotion achieves state-of-the-art performance in egocentric motion reconstruction and is the first to generate motion from a single egocentric image. Extensive evaluations demonstrate the effectiveness of our unified framework, setting a new benchmark for egocentric motion modeling and unlocking new possibilities for egocentric applications.

Wed 22 Oct. 14:15 - 16:15 PDT

#30
DAViD: Modeling Dynamic Affordance of 3D Objects Using Pre-trained Video Diffusion Models

Hyeonwoo Kim · Sangwon Baik · Hanbyul Joo

Modeling how humans interact with objects is crucial for AI to effectively assist or mimic human behaviors.Existing studies for learning such ability primarily focus on static human-object interaction (HOI) patterns, such as contact and spatial relationships, while dynamic HOI patterns, capturing the movement of humans and objects over time, remain relatively underexplored.In this paper, we present a novel framework for learning Dynamic Affordance across various target object categories. To address the scarcity of 4D HOI datasets, our method learns the 3D dynamic affordance from synthetically generated 4D HOI samples. Specifically, we propose a pipeline that first generates 2D HOI videos from a given 3D target object using a pre-trained video diffusion model, then lifts them into 3D to generate 4D HOI samples.Leveraging these synthesized 4D HOI samples, we train DAViD, our generative 4D human-object interaction model, which is composed of two key components: (1) a human motion diffusion model (MDM) with Low-Rank Adaptation (LoRA) module to fine-tune a pre-trained MDM to learn the HOI motion concepts from limited HOI motion samples, (2) a motion diffusion model for 4D object poses conditioned by produced human interaction motions.Interestingly, DAViD can integrate newly learned HOI motion concepts with pre-trained human motions to create novel HOI motions, even for multiple HOI motion concepts, demonstrating the advantage of our pipeline with LoRA in integrating dynamic HOI concepts.Through extensive experiments, we demonstrate that DAViD outperforms baselines in synthesizing HOI motion.

Wed 22 Oct. 14:15 - 16:15 PDT

#31
DNF-Intrinsic: Deterministic Noise-Free Diffusion for Indoor Inverse Rendering

Rongjia Zheng · Qing Zhang · Chengjiang Long · Wei-Shi Zheng

Recent methods have shown that pre-trained diffusion models can be fine-tuned to enable generative inverse rendering by learning image-conditioned noise-to-intrinsic mapping. Despite their remarkable progress, they struggle to robustly produce high-quality results as the noise-to-intrinsic paradigm essentially utilizes noisy images with deteriorated structure and appearance for intrinsic prediction, while it is common knowledge that structure and appearance information in an image are crucial for inverse rendering. To address this issue, we present DNF-Intrinsic, a robust yet efficient inverse rendering approach fine-tuned from a pre-trained diffusion model, where we propose to take the source image rather than Gaussian noise as input to directly predict deterministic intrinsic properties via flow matching. Moreover, we design a generative renderer to constrain that the predicted intrinsic properties are physically faithful to the source image. Experiments on both synthetic and real-world datasets show that our method clearly outperforms existing state-of-the-art methods. Our code and trained model will be made publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#32
RoboAnnotatorX: A Comprehensive and Universal Annotation Framework for Accurate Understanding of Long-horizon Robot Demonstration

Longxin Kou · Fei Ni · Jianye HAO · Han Peilong · Jinyi Liu · Haiqin Cui · Rui Liu · YAN ZHENG

Recent advances in robotics have produced numerous valuable large-scale demonstration datasets, yet their potential remains underutilized due to annotation limitations. Current datasets often suffer from sparse temporal annotations, and inconsistent labeling granularity, particularly for complex long-horizon demonstrations. Traditional manual annotation methods are expensive and poorly scalable while existing automated methods struggle with temporal coherence and semantic richness across extended demonstrations. For this, we propose RoboAnnotatorX, a reliable annotation tool that enhances multimodal large language model to generate high-quality, context-rich annotations for complex long-horizon demonstrations. Specifically, we introduce a multi-scale token-efficient encoder to maintain computational efficiency while simultaneously capturing fine-grained visual details and preserving temporal information by jointly integrating scene-level anchoring, clip-level temporal dynamics, and video-level global modeling. We further construct a comprehensive dataset RoboX-VQA that synthesizes diverse QA pairs from both real-world and simulated data, bridging the significant domain gap in robotics demonstrations. Moreover, we leverage a curriculum-inspired three-stage training to progressively develop capabilities from basic visual perception to sophisticated temporal reasoning. Extensive experiments demonstrate that RoboAnnotatorX significantly outperforms existing approaches in annotation quality and exhibits strong generalization across diverse robotic environments, helping unlock the full potential of existing robotic datasets.

Wed 22 Oct. 14:15 - 16:15 PDT

#33
FaceShield: Defending Facial Image against Deepfake Threats

Jaehwan Jeong · Sumin In · Sieun Kim · Shin yi · Jongheon Jeong · Sang Yoon · Jaewook Chung · Sangpil Kim

The rising use of deepfakes in criminal activities presents a significant issue, inciting widespread controversy. While numerous studies have tackled this problem, most primarily focus on deepfake detection.These reactive solutions are insufficient as a fundamental approach for crimes where authenticity is disregarded.Existing proactive defenses also have limitations, as they are effective only for deepfake models based on specific Generative Adversarial Networks (GANs), making them less applicable in light of recent advancements in diffusion-based models.In this paper, we propose a proactive defense method named FaceShield, which introduces novel defense strategies targeting deepfakes generated by Diffusion Models (DMs) and facilitates defenses on various existing GAN-based deepfake models through facial feature extractor manipulations. Our approach consists of three main components: (i) manipulating the attention mechanism of DMs to exclude protected facial features during the denoising process, (ii) targeting prominent facial feature extraction models to enhance the robustness of our adversarial perturbation, and (iii) employing Gaussian blur and low-pass filtering techniques to improve imperceptibility while enhancing robustness against JPEG distortion.Experimental results on the CelebA-HQ and VGGFace2-HQ datasets demonstrate that our method achieves state-of-the-art performance against the latest deepfake models based on DMs, while also exhibiting transferability to GANs and showcasing greater imperceptibility of noise along with enhanced robustness.

Wed 22 Oct. 14:15 - 16:15 PDT

#34
Task-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers

An Lun Liu · Yu-Wei Chao · Yi-Ting Chen

In this paper, we study task-oriented human grasp synthesis, a new grasp synthesis task that demands both task and context awareness. At the core of our method is the task-aware contact maps. Unlike traditional contact maps that only reason about the manipulated object and its relation with the hand, our enhanced maps take into account scene and task information. This comprehensive map is critical for hand-object interaction, enabling accurate grasping poses that align with the task. We propose a two-stage pipeline that first constructs a task-aware contact map informed by the scene and task. In the subsequent stage, we use this contact map to synthesize task-oriented human grasps. We introduce a new dataset and metric for the proposed task to evaluate our approach. Our experiments validate the importance of modeling both scene and task, demonstrating significant improvements over existing methods in both grasp quality and task performance.

Wed 22 Oct. 14:15 - 16:15 PDT

#35
Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering

shanlin sun · Yifan Wang · Hanwen Zhang · Yifeng Xiong · Qin Ren · Ruogu Fang · Xiaohui Xie · Chenyu You

While multi-step diffusion models have advanced both forward and inverse rendering, existing approaches often treat these problems independently, leading to cycle inconsistency and slow inference speed. In this work, we present Ouroboros, a framework composed of two single-step diffusion models that handle forward and inverse rendering with mutual reinforcement. Our approach extends intrinsic decomposition to both indoor and outdoor scenes and introduces a cycle consistency mechanism that ensures coherence between forward and inverse rendering outputs. Experimental results demonstrate state-of-the-art performance across diverse scenes while achieving substantially faster inference speed compared to other diffusion-based methods. We also demonstrate that Ouroboros can transfer to video decomposition in a training-free manner, reducing temporal inconsistency in video sequences while maintaining high-quality per-frame inverse rendering.

Wed 22 Oct. 14:15 - 16:15 PDT

#36
Expressive Talking Human from Single-Image with Imperfect Priors

Jun Xiang · Yudong Guo · Leipeng Hu · Boyang Guo · Yancheng Yuan · Juyong Zhang

Building realistic and animatable avatars still requires minutes of multi-view or monocular self-rotating videos, and most methods lack precise control over gestures and expressions. To push this boundary, we address the challenge of constructing a whole-body talking avatar from a single image. We propose a novel pipeline that tackles two critical issues: 1) complex dynamic modeling and 2) generalization to novel gestures and expressions. To achieve seamless generalization, we leverage recent pose-guided image-to-video diffusion models to generate imperfect video frames as pseudo-labels. To overcome the dynamic modeling challenge posed by inconsistent and noisy pseudo-frames, we introduce a tightly coupled 3DGS-mesh hybrid avatar representation and apply several key regularizations to mitigate inconsistencies caused by imperfect labels. Extensive experiments on diverse subjects demonstrate that our method enables the creation of a photorealistic, precisely animatable, and expressive whole-body talking avatar from just a single image.

Wed 22 Oct. 14:15 - 16:15 PDT

#37
InteractAvatar: Modeling Hand-Face Interaction in Photorealistic Avatars with Deformable Gaussians

Kefan Chen · Sergiu Oprea · Justin Theiss · Sreyas Mohan · Srinath Sridhar · Aayush Prakash

With the rising interest from the community in digital avatars coupled with the importance of expressions and gestures in communication, modeling natural avatar behavior remains an important challenge across many industries such as teleconferencing, gaming, and AR/VR. Human hands are the primary tool for interacting with the environment and essential for realistic human behavior modeling, yet existing 3D hand and head avatar models often overlook the crucial aspect of hand-body interactions, such as between hand and face. We present InteracttAvatar, the first model to faithfully capture the photorealistic appearance of dynamic hand and non-rigid hand-face interactions. Our novel Dynamic Gaussian Hand model, combining template model and 3D Gaussian Splatting as well as a dynamic refinement module, captures pose-dependent change, e.g. the fine wrinkles and complex shadows that occur during articulation. Importantly, our hand-face interaction module models the subtle geometry and appearance dynamics that underlie common gestures.Through experiments of novel view synthesis, self reenactment and cross-identity reenactment, we demonstrate that InteracttAvatar can reconstruct hand and hand-face interactions from monocular or multiview videos with high-fidelity details and be animated with novel poses.

Wed 22 Oct. 14:15 - 16:15 PDT

#38
Beyond Label Semantics: Language-Guided Action Anatomy for Few-shot Action Recognition

Zefeng Qian · Xincheng Yao · Yifei Huang · Chong-Yang Zhang · Jiangyong Ying · Hong Sun

Few-shot action recognition (FSAR) aims to classify human actions in videos with only a small number of labeled samples per category. The scarcity of training data has driven recent efforts to incorporate additional modalities, particularly text. However, the subtle variations in human posture, object interactions, and the motion dynamics that occur during different phases of an action, are critical inherent knowledge of actions that cannot be fully exploited by relying solely on text within action labels.In this work, we propose Language-Guided Action Anatomy (LGA), a novel framework for FSAR that goes beyond label semantics by modeling actions at a finer granularity. LGA anatomizes both the textual and visual modalities, effectively exploring rich spatiotemporal cues across different temporal phases of actions.For text, prompt an off-the-shelf Large Language Model to anatomize labels into sequences of atomic action descriptions, focusing on the three core elements of action (subject, motion, object).For videos, we design a Visual Anatomy Module to segment actions into atomic video phases, capturing the sequential structure of actions.A fine-grained fusion strategy then integrates textual and visual features at the atomic level, resulting in more generalizable prototypes. Finally, we introduce a Multimodal Matching mechanism, comprising both video-video and video-text matching, to ensure robust few-shot classification. Experimental results demonstrate that LGA achieves state-of-the-art performance across multiple FSAR benchmarks.

Wed 22 Oct. 14:15 - 16:15 PDT

#39
Highlight
DynFaceRestore: Balancing Fidelity and Quality in Diffusion-Guided Blind Face Restoration with Dynamic Blur-Level Mapping and Guidance

Huu Phu Do · Yu-Wei Chen · Yi-Cheng Liao · Chi-Wei Hsiao · Han-Yang Wang · Wei-Chen Chiu · Ching-Chun Huang

Blind Face Restoration aims to recover high-fidelity, detail-rich facial images from unknown degraded inputs, presenting significant challenges in preserving both identity and detail. Pre-trained diffusion models have been increasingly used as image priors to generate fine details. Still, existing methods often use fixed diffusion sampling timesteps and a global guidance scale, assuming uniform degradation. This limitation and potentially imperfect degradation kernel estimation frequently lead to under- or over-diffusion, resulting in an imbalance between fidelity and quality. We propose DynFaceRestore, a novel blind face restoration approach that learns to map any blindly degraded input to Gaussian blurry images. By leveraging these blurry images and their respective Gaussian kernels, we dynamically select the starting timesteps for each blurry image and apply closed-form guidance during the diffusion sampling process to maintain fidelity. Additionally, we introduce a dynamic guidance scaling adjuster that modulates the guidance strength across local regions, enhancing detail generation in complex areas while preserving structural fidelity in contours. This strategy effectively balances the trade-off between fidelity and quality. DynFaceRestore achieves state-of-the-art performance in both quantitative and qualitative evaluations, demonstrating robustness and effectiveness in blind face restoration.

Wed 22 Oct. 14:15 - 16:15 PDT

#40
Few-Shot Image Quality Assessment via Adaptation of Vision-Language Models

Xudong Li · Zihao Huang · Yan Zhang · Yunhang Shen · Ke Li · Xiawu Zheng · Liujuan Cao · Rongrong Ji

Image Quality Assessment (IQA) remains an unresolved challenge in the field of computer vision, due to complex distortion conditions, diverse image content, and limited data availability. The existing Blind IQA (BIQA) methods heavily rely on extensive human annotations to train models, which is both labor-intensive and costly due to the demanding nature of creating IQA datasets. To mitigate the dependence on labeled samples, this paper introduces a novel Gradient-Regulated Meta-Prompt IQA Framework (GRMP-IQA). This framework aims to fast adapt the powerful visual-language pre-trained model, CLIP, to downstream IQA tasks, significantly improving accuracy in scenarios with limited data. Specifically, the GRMP-IQA comprises two key modules: Meta-Prompt Pre-training Module and Quality-Aware Gradient Regularization. The Meta Prompt Pre-training Module leverages a meta-learning paradigm to pre-train soft prompts with shared meta-knowledge across different distortions, enabling rapid adaptation to various IQA tasks. On the other hand, the Quality-Aware Gradient Regularization is designed to adjust the update gradients during fine-tuning, focusing the model's attention on quality-relevant features and preventing overfitting to semantic information. Extensive experiments on five standard BIQA datasets demonstrate the superior performance to the state-of-the-art BIQA methods under limited data setting, i.e., achieving the SRCC values of 0.836 ( vs. 0.760 in LIVEC) and 0.853 ( vs. 0.812 in KonIQ). Notably, utilizing just {20%} of the training data, GRMP-IQA outperforms most existing fully supervised BIQA methods.

Wed 22 Oct. 14:15 - 16:15 PDT

#41
Unleashing High-Quality Image Generation in Diffusion Sampling Using Second-Order Levenberg-Marquardt-Langevin

Fangyikang Wang · Hubery Yin · Lei Qian · Yinan Li · SHAOBIN ZHUANG · Huminhao Zhu · Yilin Zhang · Yanlong Tang · Chao Zhang · Hanbin Zhao · Hui Qian · Chen Li

The emerging diffusion models (DMs) have demonstrated the remarkable capability of generating images via learning the noised score function of data distribution.Current DM sampling techniques typically rely on first-order Langevin dynamics at each noise level, with efforts concentrated on refining inter-level denoising strategies.While leveraging additional second-order Hessian geometry to enhance the sampling quality of Langevin is a common practice in Markov chain Monte Carlo (MCMC), the naive attempts to utilize Hessian geometry in high-dimensional DMs lead to quadratic-complexity computational costs, rendering them non-scalable.In this work, we introduce a novel Levenberg-Marquardt-Langevin (LML) method that approximates the diffusion Hessian geometry in a training-free manner, drawing inspiration from the celebrated Levenberg-Marquardt optimization algorithm.Our approach introduces two key innovations: (1) A low-rank approximation of the diffusion Hessian, leveraging the DMs' inherent structure and circumventing explicit quadratic-complexity computations; (2) A damping mechanism to stabilize the approximated Hessian.This LML approximated Hessian geometry enables the diffusion sampling to execute more accurate steps and improve the image generation quality.We further conduct theoretical analysis to substantiate the approximation error bound of low-rank approximation and the convergence property of damping mechanism. Extensive experiments across multiple pretrained DMs validate that the LML method significantly improves image generation quality, with negligible computational overhead.

In this paper, we propose a novel framework for solving high-definition video inverse problems using latent image diffusion models. Building on recent advancements in spatio-temporal optimization for video inverse problems using image diffusion models, our approach leverages latent-space diffusion models to achieve enhanced video quality and resolution. To address the high computational demands of processing high-resolution frames, we introduce a pseudo-batch consistent sampling strategy, allowing efficient operation on a single GPU. Additionally, to improve temporal consistency, we present pseudo-batch inversion, an initialization technique that incorporates informative latents from the measurement. By integrating with SDXL, our framework achieves state-of-the-art video reconstruction across a wide range of spatio-temporal inverse problems, including complex combinations of frame averaging and various spatial degradations, such as deblurring, super-resolution, and inpainting. Unlike previous methods, our approach supports multiple aspect ratios (landscape, vertical, and square) and delivers HD-resolution reconstructions (exceeding 1280×720) in under 6 seconds per frame on a single NVIDIA 4090 GPU. Project page: https://vision-xl.github.io/.

Wed 22 Oct. 14:15 - 16:15 PDT

#43
Dynamic Group Detection using VLM-augmented Temporal Groupness Graph

Kaname Yokoyama · Chihiro Nakatani · Norimichi Ukita

This paper proposes dynamic human group detection in videos. For detecting complex groups, not only the local appearance features of in-group members but also the global context of the scene are important. Such local and global appearance features in each frame are extracted using a Vision-Language Model (VLM) augmented for group detection in our method. For further improvement, the group structure should be consistent over time. While previous methods are stabilized on the assumption that groups are not changed in a video, our method detects dynamically changing groups by global optimization using a graph with all frames' groupness probabilities estimated by our groupness-augmented CLIP features. Our experimental results demonstrate that our method outperforms state-of-the-art group detection methods on public datasets. Code: https://anonymous.4open.science/r/ICCV2025_DVT-D1A5

Wed 22 Oct. 14:15 - 16:15 PDT

#44
When Lighting Deceives: Exposing Vision-Language Models' Illumination Vulnerability Through Illumination Transformation Attack

Hanqing Liu · Shouwei Ruan · Yao Huang · Shiji Zhao · Xingxing Wei

Vision-Language Models (VLMs) have achieved remarkable success in various tasks, yet their robustness to real-world illumination variations remains largely unexplored. To bridge this gap, we propose $\textbf{I}$llumination $\textbf{T}$ransformation $\textbf{A}$ttack ($\textbf{ITA}$), the first framework to systematically assess VLMs' robustness against illumination changes. However, there still exist two key challenges: (1) how to model global illumination with fine-grained control to achieve diverse lighting conditions and (2) how to ensure adversarial effectiveness while maintaining naturalness. To address the first challenge, we innovatively decompose global illumination into multiple parameterized point light sources based on the illumination rendering equation. This design enables us to model more diverse lighting variations that previous methods could not capture. Then, by integrating these parameterized lighting variations with physics-based lighting reconstruction techniques, we could precisely render such light interactions in the original scenes, finally meeting the goal of fine-grained lighting control. For the second challenge, by controlling illumination through the lighting reconstrution model's latent space rather than direct pixel manipulation, we inherently preserve physical lighting priors. Furthermore, to prevent potential reconstruction artifacts, we design additional perceptual constraints for maintaining visual consistency with original images and diversity constraints for avoiding light source convergence. Extensive experiments demonstrate that our ITA could significantly reduce the performance of advanced VLMs, e.g., LLaVA-1.6, while possessing competitive naturalness, exposing VLMS' critical illuminiation vulnerabilities.

Wed 22 Oct. 14:15 - 16:15 PDT

#45
Self-Calibrated Variance-Stabilizing Transformations for Real-World Image Denoising

Sébastien Herbreteau · Michael Unser

Supervised deep learning has become the method of choice for image denoising. It involves the training of neural networks on large datasets composed of pairs of noisy and clean images. However, the necessity of training data that are specific to the targeted application constrains the widespread use of denoising networks. Recently, several approaches have been developed to overcome this difficulty by whether artificially generating realistic clean/noisy image pairs, or training exclusively on noisy images. In this paper, we show that, contrary to popular belief, denoising networks specialized in the removal of Gaussian noise can be efficiently leveraged in favor of real-world image denoising, even without additional training. For this to happen, an appropriate variance-stabilizing transform (VST) has to be applied beforehand. We propose an algorithm termed Noise2VST for the learning of such a model-free VST. Our approach requires only the input noisy image and an off-the-shelf Gaussian denoiser. We demonstrate through extensive experiments the efficiency and superiority of Noise2VST in comparison to existing methods trained in the absence of specific clean/noisy pairs.

Wed 22 Oct. 14:15 - 16:15 PDT

#46
Reverse Convolution and Its Applications to Image Restoration

Xuhong Huang · Shiqi Liu · Kai Zhang · Ying Tai · Jian Yang · Hui Zeng · Lei Zhang

Convolution and transposed convolution are fundamental operators widely used in neural networks. However, transposed convolution, a.k.a. deconvolution, does not truly invert convolution due to their inherent differences in formulation. To date, there is no reverse convolution operator that has been developed as a basic component in deep neural networks. In this paper, we propose a novel depthwise reverse convolution operator as a first-step exploration to effectively reverse the depthwise convolution by formulating and solving a regularized least-squares optimization problem. We thoroughly investigate its kernel initialization, padding strategies, and other critical aspects to ensure its effective implementation. Building upon this reverse convolution operator, we integrate it with layer normalization, 1$\times$1 convolution, and GELU activation to form a reverse convolution block, similar to a Transformer block. The proposed reverse convolution block can easily replace its convolution and transposed convolution counterparts in existing architectures, leading to the development of ConverseNet. By incorporating it into classical models like DnCNN, SRResNet and USRNet, we train ConverseNet to solve three typical image restoration tasks including Gaussian denoising, super-resolution and deblurring. Extensive experiments demonstrate the effectiveness of the proposed reverse convolution operator as both a fundamental building block and a novel deconvolution operator for inverse problems. We hope our work could pave the way for developing new operators in deep model design and applications.

Wed 22 Oct. 14:15 - 16:15 PDT

#47
MamTiff-CAD: Multi-Scale Latent Diffusion with Mamba+ for Complex Parametric Sequence

Liyuan Deng · Yunpeng Bai · Yongkang Dai · Xiaoshui Huang · Hongping Gan · Dongshuo Huang · Hao jiacheng · Yilei Shi

Parametric Computer-Aided Design (CAD) is crucial in industrial applications, yet existing approaches often struggle to generate long sequence parametric commands due to complex CAD models' geometric and topological constraints. To address this challenge, we propose MamTiff-CAD, a novel CAD parametric command sequences generation framework that leverages a Transformer-based diffusion model for multi-scale latent representations. Specifically, we design a novel autoencoder that integrates Mamba+ and Transformer, to transfer parameterized CAD sequences into latent representations. The Mamba+ block incorporates a forget gate mechanism to effectively capture long-range dependencies. The non-autoregressive Transformer decoder reconstructs the latent representations. A diffusion model based on multi-scale Transformer is then trained on these latent embeddings to learn the distribution of long sequence commands. In addition, we also construct a dataset that consists of long parametric sequences, which is up to 256 commands for a single CAD model. Experiments demonstrate that MamTiff-CAD achieves state-of-the-art performance on both reconstruction and generation tasks, confirming its effectiveness for long sequence (60-256) CAD model generation.

Wed 22 Oct. 14:15 - 16:15 PDT

#48
Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer

Md Ashiqur Rahman · Chiao-An Yang · Michael N Cheng · Lim Hao · Jeremiah Jiang · Teck-Yian Lim · Raymond A. Yeh

Scale variation is a fundamental challenge in computer vision. Objects of the same class can have different sizes, and their perceived size is further affected by the distance from the camera. These variations are local to the objects, i.e., different object sizes may change differently within the same image. To effectively handle scale variations, we present a deep equilibrium canonicalizer (DEC) to improve the local scale equivariance of a model. DEC can be easily incorporated into existing network architectures and can be adapted to a pre-trained model. Notably, we show that on the competitive ImageNet benchmark, DEC improves both model performance and local scale consistency across four popular pre-trained deep-nets, e.g., ViT, DeiT, Swin, and BEiT.

Wed 22 Oct. 14:15 - 16:15 PDT

#49
HOMO-Feature: Cross-Arbitrary-Modal Image Matching with Homomorphism of Organized Major Orientation

Chenzhong Gao · Wei Li · Desheng Weng

An exploration of cross-arbitrary-modal image invariant feature extraction and matching is made, with a purely handcrafted full-chain algorithm, Homomorphism of Organized Major Orientation (HOMO), being proposed. Instead of using deep models to conduct data-driven black-box learning, we introduce a Major Orientation Map (MOM), effectively combating image modal differences. Considering rotation, scale, and texture diversities in cross-modal images, HOMO incorporates a novel, universally designed Generalized-Polar descriptor (GPolar) and a Multi-scale Strategy (MsS) to gain well-rounded capacity. HOMO achieves the best comprehensive performance in feature matching on a several generally cross-modal datasets, challenging compared with a set of state-of-the-art methods including 7 traditional algorithms and 10 deep network models. A dataset named General Cross-modal Zone (GCZ) is proposed, which shows practical values.

Wed 22 Oct. 14:15 - 16:15 PDT

#50
DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability

Xirui Hu · Jiahao Wang · Hao chen · Weizhan Zhang · Benqi Wang · yikun Li · Haishun Nan

Recent advancements in text-to-image generation have spurred interest in personalized human image generation, which aims to create novel images featuring specific human identities as reference images indicate. Although existing methods achieve high-fidelity identity preservation, they often struggle with limited multi-ID usability and inadequate facial editability. We present DynamicID, a tuning-free framework supported by a dual-stage training paradigm that inherently facilitates both single-ID and multi-ID personalized generation with high fidelity and flexible facial editability. Our key innovations include: 1) Semantic-Activated Attention (SAA), which employs query-level activation gating to minimize disruption to the original model when injecting ID features and achieve multi-ID personalization without requiring multi-ID samples during training. 2) Identity-Motion Reconfigurator (IMR), which leverages contrastive learning to effectively disentangle and re-entangle facial motion and identity features, thereby enabling flexible facial editing. Additionally, we have developed the curated VariFace-10k facial dataset, comprising 10k unique individuals, each represented by 35 distinct facial images. Experimental results demonstrate that DynamicID outperforms state-of-the-art methods in identity fidelity, facial editability, and multi-ID personalization capability.

Wed 22 Oct. 14:15 - 16:15 PDT

#51
FreeDance: Towards Harmonic Free-Number Group Dance Generation via a Unified Framework

Yiwen Zhao · Yang Wang · Liting Wen · Hengyuan Zhang · Xingqun Qi

Generating harmonic and diverse human motions from music signals, especially for multi-person group dance, is a practical yet challenging task in virtual avatar creation.Existing methods merely model the group dance with a fixed number of dancers, lacking the flexibility to generate arbitrary individual group movements. To fulfill this goal, we propose a novel unified framework capable of synthesizing free-number dancers harmonically aligned with given music, namely $\textbf{\textit{FreeDance}}$. Considering the plausibility of arbitrary dancer generation while preserving the diverse dynamics of multiple individuals, we build the framework upon collaborative masked token modeling in 2D discrete space. In particular, we devise a $\textbf{\textit{Cross-modality Residual Alignment Module (CRAM)}}$ to diversify the movement of each individual and intensify its alignment with music.CRAM captures the spatial motion deformation of each dancer using residual learning and integrates it with rhythmic representation into a joint embedding. We leverage this joint embedding to enhance cross-entity alignment while reinforcing the intrinsic connection between motion and music.Moreover, recognizing the requirement of interactive coordination of generated multi-dancer motions, we design a $\textbf{\textit{Temporal Interaction Module (TIM)}}$.Benefiting from masked 2D motion tokens, TIM effectively models the temporal correlation between current individuals w.r.t neighboring dancers as interaction guidance to foster stronger inter-dancer dependencies.Extensive experiments demonstrate that our approach generates harmonic group dance with any number of individuals, outperforming state-of-the-art methods adapting number-fixed counterparts.

Wed 22 Oct. 14:15 - 16:15 PDT

#52
SILO: Solving Inverse Problems with Latent Operators

Ron Raphaeli · Sean Man · Michael Elad

Plug-and-play methods for solving inverse problems have continuously improved over the years by incorporating more advanced image priors.Latent diffusion models are among the most powerful priors, making them a natural choice for solving inverse problems. However, existing approaches require multiple applications of an Autoencoder to transition between pixel and latent spaces during restoration, leading to high computational costs and degraded restoration quality. In this work, we introduce a new plug-and-play paradigm that operates entirely in the latent space of diffusion models. By emulating pixel-space degradations directly in the latent space through a short learning phase, we eliminate the need for the Autoencoder during restoration, enabling faster inference and improved restoration fidelity. We validate our method across various image restoration tasks and datasets, achieving significantly higher perceptual quality than previous methods while being $2.6{-}10{\times}$ faster in inference and $1.7{-}7{\times}$ faster when accounting for the learning phase of the latent operator.

Wed 22 Oct. 14:15 - 16:15 PDT

#53
Learning Precise Affordances from Egocentric Videos for Robotic Manipulation

Li · Nikolaos Tsagkas · Jifei Song · Ruaridh Mon-Williams · Sethu Vijayakumar · Kun Shao · Laura Sevilla-Lara

Affordance, defined as the potential actions that an object offers, is crucial for embodied AI agents. For example, such knowledge directs an agent to grasp a knife by the handle for cutting or by the blade for safe handover. While existing approaches have made notable progress, affordance research still faces three key challenges: data scarcity, poor generalization, and real-world deployment. Specifically, there is a lack of large-scale affordance datasets with precise segmentation maps, existing models struggle to generalize across different domains or novel object and affordance classes, and little work demonstrates deployability in real-world scenarios. In this work, we address these issues by proposing a complete affordance learning system that (1) takes in egocentric videos and outputs precise affordance annotations without human labeling, (2) leverages geometric information and vision foundation models to improve generalization, and (3) introduces a framework that facilitates affordance-oriented robotic manipulation such as tool grasping and robot-to-human tool handover. Experimental results show that our model surpasses the state-of-the-art by 13.8% in mIoU, and the framework achieves 77.1% successful grasping among 179 trials, including evaluations on seen, unseen classes, and cluttered scenes.

Wed 22 Oct. 14:15 - 16:15 PDT

#54
EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models

Yufei Cai · Hu Han · Yuxiang Wei · Shiguang Shan · Xilin Chen

The progress on generative models has led to significant advances on text-to-video (T2V) generation, yet the motion controllability of generated videos remains limited. Existing motion transfer approaches explored the motion representations of reference videos to guide generation. Nevertheless, these methods typically rely on sample-specific optimization frameworks, resulting in high computational burdens. In this paper, we propose EfficientMT, a novel and efficient end-to-end framework for video motion transfer. By leveraging a small set of synthetic paired motion transfer samples, EfficientMT effectively adapts a pretrained T2V model into a general motion transfer framework that can accurately capture and reproduce diverse motion patterns. Specifically, we repurpose the backbone of the T2V model to extract temporal information from reference videos, and further propose a scaler module to distill motion-related information. Subsequently, we introduce a temporal integration mechanism that seamlessly incorporates reference motion features into the video generation process. After training on our self-collected synthetic paired samples, EfficientMT enables general video motion transfer without requiring test-time optimization. Extensive experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability. Our code will be made publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#55
Highlight
X-Dancer: Expressive Music to Human Dance Video Generation

Zeyuan Chen · Hongyi Xu · Guoxian Song · You Xie · Chenxu Zhang · Xin Chen · Chao Wang · Di Chang · Linjie Luo

We present X-Dancer, a novel zero-shot music-driven image animation pipeline that creates diverse and long-range lifelike human dance videos from a single static image. As its core, we introduce a unified transformer-diffusion framework, featuring an autoregressive transformer model that synthesize extended and music-synchronized token sequences for 2D body, head and hands poses, which then guide a diffusion model to produce coherent and realistic dance video frames. Unlike traditional methods that primarily generate human motion in 3D, X-Dancer addresses data limitations and enhances scalability by modeling a wide spectrum of 2D dance motions, capturing their nuanced alignment with musical beats through readily available monocular videos. To achieve this, we first build a spatially compositional token representation from 2D human pose labels associated with keypoint confidences, encoding both large articulated body movements (e.g., upper and lower body) and fine-grained motions (e.g., head and hands). We then design a music-to-motion transformer model that autoregressively generates music-aligned dance pose token sequences, incorporating global attention to both musical style and prior motion context. Finally we leverage a diffusion backbone to animate the reference image with these synthesized pose tokens through AdaIN, forming a fully differentiable end-to-end framework. Experimental results demonstrate that X-Dancer is able to produce both diverse and characterized dance videos, substantially outperforming state-of-the-art methods in term of diversity, expressiveness and realism. Code and model will be released for research purposes.

Wed 22 Oct. 14:15 - 16:15 PDT

#56
DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning

Ruowen Zhao · James Jun Liang Chen Ye · Zhengyi Wang · Guangce Liu · Yiwen Chen · Yikai Wang · Jun Zhu

Triangle meshes play a crucial role in 3D applications for efficient manipulation and rendering. While auto-regressive methods generate structured meshes by predicting discrete vertex tokens, they are often constrained by limited face counts and mesh incompleteness. To address these challenges, we propose DeepMesh, a framework that optimizes mesh generation through two key innovations: (1) an efficient pre-training strategy incorporating a novel tokenization algorithm, along with improvements in data curation and processing, and (2) the introduction of Reinforcement Learning (RL) into 3D mesh generation to achieve human preference alignment via Direct Preference Optimization (DPO). We design a scoring standard that combines human evaluation with 3D metrics to collect preference pairs for DPO, ensuring both visual appeal and geometric accuracy. Conditioned on point clouds and images, DeepMesh generates meshes with intricate details and precise topology, outperforming state-of-the-art methods in both precision and quality.

Wed 22 Oct. 14:15 - 16:15 PDT

#57
SAM Encoder Breach by Adversarial Simplicial Complex Triggers Downstream Model Failures

Yi Qin · Rui Wang · Tao Huang · Tong Xiao · Liping Jing

While the Segment Anything Model (SAM) transforms interactive segmentation with zero-shot abilities, its inherent vulnerabilities present a single-point risk, potentially leading to the failure of downstream applications. Proactively evaluating these transferable vulnerabilities is thus imperative. Prior adversarial attacks on SAM often present limited transferability due to insufficient exploration of common weakness across domains. To address this, we propose a novel method, Vertex-Refining Simplicial Complex Attack (VeSCA), generating transferable adversarial examples by explicitly characterizing the shared vulnerable regions between SAM and downstream models through a parametric simplicial complex. Our goal is to identify such complexes within adversarially potent regions by iterative vertex-wise refinement.A lightweight domain re-adaptation strategy is introduced to bridge domain divergence using minimal reference data. Notably, VeSCA leverages only the encoder of SAM, which mitigates overfitting issue, and generates consistently transferable adversarial examples by random simplicial complex sampling. Extensive experiments demonstrate that VeSCA achieves performance improved by 12.7\% compared to state-of-the-art methods across three downstream model categories across five domain-specific datasets. Our findings further highlight the downstream model risks posed by SAM’s vulnerabilities.

Wed 22 Oct. 14:15 - 16:15 PDT

#58
Text-IRSTD: Leveraging Semantic Text to Promote Infrared Small Target Detection in Complex Scenes

Feng Huang · Shuyuan Zheng · Zhaobing Qiu · Huanxian Liu · huanxin Bai · Liqiong Chen

Infrared small target detection is currently a hot and challenging task in computer vision. Existing methods usually focus on mining visual features of targets, which struggles to cope with complex and diverse detection scenarios. The main reason is that infrared small targets have limited image information on their own, thus relying only on visual features fails to discriminate targets and interferences, leading to lower detection performance. To address this issue, we introduce a novel approach leveraging semantic text to guide infrared small target detection, called Text-IRSTD. It innovatively expands classical IRSTD to text-guided IRSTD, providing a new research idea. On the one hand, we devise a novel fuzzy semantic text prompt to accommodate ambiguous target categories. On the other hand, we propose a progressive cross-modal semantic interaction decoder (PCSID) to facilitate information fusion between texts and images. In addition, we construct a new benchmark consisting of 2,755 infrared images of different scenarios with fuzzy semantic textual annotations, called FZDT. Extensive experimental results demonstrate that our method achieves better detection performance and target contour recovery than the state-of-the-art methods. Moreover, proposed Text-IRSTD shows strong generalization and wide application prospects in unseen detection scenarios. The dataset and code will be publicly released after acceptance of this paper.

Wed 22 Oct. 14:15 - 16:15 PDT

#59
AMDANet: Attention-Driven Multi-Perspective Discrepancy Alignment for RGB-Infrared Image Fusion and Segmentation

Haifeng Zhong · Fan Tang · Zhuo Chen · Hyung Jin Chang · Yixing Gao

The challenge of multimodal semantic segmentation lies in establishing semantically consistent and segmentable multimodal fusion features under conditions of significant visual feature discrepancies. Existing methods commonly construct cross-modal self-attention fusion frameworks or introduce additional multimodal fusion loss functions to establish fusion features. However, these approaches often overlook the challenge caused by feature discrepancies between modalities during the fusion process. To achieve precise segmentation, we propose an Attention-Driven Multimodal Discrepancy Alignment Network (AMDANet). AMDANet reallocates weights to reduce the saliency of discrepant features and utilizes low-weight features as cues to mitigate discrepancies between modalities, thereby achieving multimodal feature alignment. Furthermore, to simplify the feature alignment process, a semantic consistency inference mechanism is introduced to reveal the network's inherent bias toward specific modalities, thereby compressing cross-modal feature discrepancies from the foundational level.Extensive experiments on the FMB, MFNet, and PST900 datasets demonstrate that AMDANet achieves mIoU improvements of 3.6%, 3.0%, and 1.6%, respectively, significantly outperforming state-of-the-art methods.

Wed 22 Oct. 14:15 - 16:15 PDT

#60
Im2Haircut: Single-view Strand-based Hair Reconstruction for Human Avatars

Vanessa Sklyarova · Egor Zakharov · Malte Prinzler · Giorgio Becherini · Michael Black · Justus Thies

We present a novel approach for hair reconstruction from single photographs based on a global hair prior combined with local optimization.Capturing strand-based hair geometry from single photographs is challenging due to the variety and geometric complexity of hairstyles and the lack of ground truth training data.Classical reconstruction methods like multi-view stereo only reconstruct the visible hair strands, missing the inner structure of hair and hampering realistic hair simulation.To address this, existing methods leverage hairstyle priors trained on synthetic data.Such data, however, is limited in both quantity and quality since it requires manual work from skilled artists to model the 3D hairstyles and create nearly-photorealistic renderings.To address this, we propose a novel approach that uses both, real and synthetic data to learn an effective hairstyle prior.Specifically, we train a transformer-based prior model on synthetic data to obtain knowledge of the internal hairstyle geometry and introduce real data in the learning process to model the outer structure.This training scheme is able to model the visible hair strands depicted in an input image, while preserving the general structure of hairstyles.We exploit this prior to create a Gaussian-splatting-based reconstruction method that creates hairstyles from one or more images.Through qualitative and quantitative comparisons with existing reconstruction pipelines, we demonstrate the effectiveness and superior performance of our method for capturing detailed hair orientation, overall silhouette, and backside consistency.

Wed 22 Oct. 14:15 - 16:15 PDT

#61
AFUNet: Cross-Iterative Alignment-Fusion Synergy for HDR Reconstruction via Deep Unfolding Paradigm

Xinyue Li · Zhangkai Ni · Wenhan Yang

Existing learning-based methods effectively reconstruct HDR images from multi-exposure LDR inputs with extended dynamic range and improved detail, but their black-box design restricts interpretability and consistency. To address these limitations, we propose the cross-iterative Alignment and Fusion deep Unfolding Network (AFUNet), where HDR reconstruction is systematically decoupled into two interleaved subtasks—alignment and fusion—optimized through alternating refinement, achieving synergy between the two subtasks to enhance the overall performance. Our method formulates multi-exposure HDR reconstruction from a Maximum A Posteriori (MAP) estimation perspective, explicitly incorporating spatial correspondence priors across LDR images and naturally bridging the alignment and fusion subproblems through joint constraints. Building on the mathematical foundation, we reimagine traditional iterative optimization through unfolding—transforming the conventional solution process into an end-to-end trainable AFUNet with carefully designed modules that work progressively. Specifically, each iteration of AFUNet incorporates an Alignment-Fusion Module (AFM) that alternates between a Spatial Alignment Module (SAM) for alignment and a Channel Fusion Module (CFM) for adaptive feature fusion, progressively bridging misaligned content and exposure discrepancies. Extensive qualitative and quantitative evaluations demonstrate AFUNet’s superior performance, consistently surpassing state-of-the-art methods. Our codes will be made available.

Wed 22 Oct. 14:15 - 16:15 PDT

#62
PINO: Person-Interaction Noise Optimization for Long-Duration and Customizable Motion Generation of Arbitrary-Sized Groups

Sakuya Ota · Qing Yu · Kent Fujiwara · Satoshi Ikehata · Ikuro Sato

Generating realistic group interactions involving multiple characters remains challenging due to increasing complexity as group size expands. While existing conditional diffusion models incrementally generate motions by conditioning on previously generated characters, they rely on single shared prompts, limiting nuanced control and leading to overly simplified interactions. In this paper, we introduce Person-Interaction Noise Optimization (PINO), a novel, training-free framework designed for generating realistic and customizable interactions among groups of arbitrary size. PINO decomposes complex group interactions into sequential, semantically relevant pairwise interactions, leveraging pretrained two-person interaction diffusion models. To ensure physical plausibility and avoid common artifacts such as overlapping or penetration between characters, PINO employs physics-based penalties during noise optimization. This approach allows precise user control over character orientation, speed, and spatial relationships without additional training. Comprehensive evaluations demonstrate that PINO generates visually realistic, physically coherent, and adaptable multi-person interactions suitable for diverse animation, gaming, and robotics applications.

Wed 22 Oct. 14:15 - 16:15 PDT

#63
TeRA: Rethinking Text-guided Realistic 3D Avatar Generation

Yanwen Wang · Yiyu Zhuang · Jiawei Zhang · Li Wang · Yifei Zeng · Xun Cao · Xinxin Zuo · Hao Zhu

Efficient 3D avatar creation is a significant demand in the metaverse, film/game, AR/VR, etc. In this paper, we rethink text-to-avatar generative models by proposing TeRA, a more efficient and effective framework than the previous SDS-based models and general large 3D generative models. Our approach employs a two-stage training strategy for learning a native 3D avatar generative model. Initially, we distill a deencoder to derive a structured latent space from a large human reconstruction model. Subsequently, a text-controlled latent diffusion model is trained to generate photorealistic 3D human avatars within this latent space. TeRA enhances the model performance by eliminating slow iterative optimization and enables text-based partial customization through a structured 3D human representation. Experiments have proven our approach's superiority over previous text-to-avatar generative models in subjective and objective evaluation. The code and data will be publicly released upon publication.

Wed 22 Oct. 14:15 - 16:15 PDT

#64
A Unified Framework for Motion Reasoning and Generation in Human Interaction

Jeongeun Park · Sungjoon Choi · Sangdoo Yun

Recent advancements in large language models (LLMs) have significantly improved their ability to generate natural and contextually relevant text, enabling more human-like AI interactions. However, generating and understanding interactive human-like motion, where multiple individuals engage in coordinated movements, remains challenging due to the complexity of modeling these interactions. Additionally, a unified and versatile model is needed to handle diverse interactive scenarios, such as chat systems that dynamically adapt to user instructions and assigned roles.To address these challenges, we introduce VIM, the Versatile Interactive Motion-language model, which integrates both language and motion modalities to effectively understand, generate, and control interactive motions in multi-turn conversational contexts. Unlike previous studies that primarily focus on uni-directional tasks such as text-to-motion or motion-to-text, VIM employs a unified architecture capable of simultaneously understanding and generating both motion and text modalities.Given the absence of an appropriate dataset to support this task, we introduce Inter-MT$^2$, a large-scale instruction-tuning dataset containing 82.7K multi-turn interactive motion instructions, covering 153K interactive motion samples. Inter-MT$^2$ spans diverse instructional scenarios, including motion editing, question answering, and story generation, leveraging off-the-shelf large language models and motion diffusion models to construct a broad set of interactive motion instructions.We extensively evaluate the versatility of VIM across multiple interactive motion-related tasks, including motion-to-text, text-to-motion, reaction generation, motion editing, and reasoning about motion sequences. Notably, VIM is the first model capable of effectively addressing all these tasks within a single unified framework, achieving competitive performance compared to task-specific methods.

Wed 22 Oct. 14:15 - 16:15 PDT

#65
Open-World Skill Discovery from Unsegmented Demonstration Videos

Jingwen Deng · Zihao Wang · Shaofei Cai · Anji Liu · Yitao Liang

Learning skills in open-world environments is essential for developing agents capable of handling a variety of tasks by combining basic skills.Online demonstration videos are typically long and unsegmented, making them difficult to segment and label with skill identifiers.Unlike existing methods that rely on sequence sampling or human labeling, we have developed a self-supervised learning-based approach to segment these long videos into a series of semantic-aware and skill-consistent segments.Drawing inspiration from human cognitive event segmentation theory, we introduce Skill Boundary Detection (SBD), an annotation-free temporal video segmentation algorithm. SBD detects skill boundaries in a video by leveraging prediction errors from a pretrained unconditional action-prediction model. This approach is based on the assumption that a significant increase in prediction error indicates a shift in the skill being executed. We evaluated our method in the Minecraft environment, a rich open-world simulator with extensive gameplay videos available online. Our SBD-generated segments improved the average performance of two conditioned policies by 63.7\% and 52.1\% on short-term atomic skill tasks, and their corresponding hierarchical agents by 11.3\% and 20.8\% on long-horizon tasks.

Wed 22 Oct. 14:15 - 16:15 PDT

#66
CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation

Elena Bueno-Benito · Mariella Dimiccoli

Unsupervised action segmentation has recently pushed its limits with ASOT, an optimal transport (OT)-based method that simultaneously learns action representations and performs clustering using pseudo-labels. Unlike other OT-based approaches, ASOT makes no assumptions on the action ordering, and it is able to decode a temporally consistent segmentation from a noisy cost matrix between video frames and action labels. However, the resulting segmentation lacks segment-level supervision, which limits the effectiveness of the feedback between frames and action representations. To address this limitation, we propose Closed Loop Optimal Transport (CLOT), a novel OT-based framework that introduces a multi-level cyclic feature learning mechanism. Leveraging its encoder-decoder architecture, CLOT learns pseudo-labels alongside frame and segment embeddings by solving two separate OT problems. It then refines both frame embeddings and pseudo-labels through cross-attention between the learned frame and segment embeddings, integrating a third OT problem. Experimental results on four benchmark datasets demonstrate the benefits of cyclical learning for unsupervised action segmentation.

In the field of pan-sharpening, existing deep methods are hindered in deepening cross-modal complementarity in the intermediate feature, and lack effective strategies to harness the network entirety for optimal solutions, exhibiting limited feasibility and interpretability due to their black-box designs. Besides, validating pan-sharpening performance in high-level semantic tasks is intractable for the absence of datasets. To tackle these issues, we propose a deep adaptive unfolded network via spatial morphology stripping and spectral filtration for pan-sharpening, which is conceptualized as a linear inverse problem regularized by spatial and spectral priors. Specifically, we incorporate phase-oriented constraints into spatial priors to facilitate the thorough extraction of modal-invariant spatial morphology by intrinsic decomposition and leverage physics-driven spectral filtration attention mechanisms aligned with spectral prior to mine the inherent spectral correlation. After transparently unfolding the model into a multi-stage network, an adaptive stage-exiting mechanism is designed to capitalize on fusion diversity by aggregating optimal image patches across candidate stages. To systematically complete the assessment, we construct the first panoptic segmentation as a semantic-level benchmark for pan-sharpening performance validation. Extensive experiments are conducted to verify the merits of our method with state-of-the-art methods.

Wed 22 Oct. 14:15 - 16:15 PDT

#68
EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception

Sanjoy Chowdhury · Subrata Biswas · Sayan Nag · Tushar Nagarajan · Calvin Murdock · Ishwarya Ananthabhotla · Yijun Qian · Vamsi Ithapu · Dinesh Manocha · Ruohan Gao

Modern perception models, particularly those designedfor multisensory egocentric tasks, have achieved remark-able performance but often come with substantial compu-tational costs. These high demands pose challenges forreal-world deployment, especially in resource-constrainedenvironments. In this paper, we introduce EGOADAPT, aframework that adaptively performs cross-modal distilla-tion and policy learning to enable efficient inference acrossdifferent egocentric perception tasks, including egocentricaction recognition, active speaker localization, and behav-ior anticipation. Our proposed policy module is adapt-able to task-specific action spaces, making it broadly appli-cable. Experimental results on three challenging egocen-tric datasets—EPIC-Kitchens, EasyCom, and Aria Every-day Activities—demonstrate that our method significantlyenhances efficiency, reducing GMACs by up to 89.09%, pa-rameters up to 82.02%, and energy up to 9.6×, while stillon-par and in many cases outperforming, the performanceof corresponding state-of-the-art models.

Wed 22 Oct. 14:15 - 16:15 PDT

#69
Towards Human-like Virtual Beings: Simulating Human Behavior in 3D Scenes

CHEN LIANG · Wenguan Wang · Yi Yang

Building autonomous agents that can replicate human behavior in the realistic 3D world is a key step toward artificial general intelligence. This requires agents to be holistic goal achievers and to naturally adapt to environmental dynamics. In this work, we introduce ACTOR, an agent capable of performing high-level, long-horizon, abstract goals in 3D households, guided by its internal value similar to those of humans. ACTOR operates in a perceive-plan-act cycle, extending the ungrounded, scene-agnostic LLM controller with deliberate goal decomposition and decision-making through actively searching the behavior space, generating activity choices based on a hierarchical prior, and evaluating these choices using customizable value functions to determine the subsequent steps. Furthermore, we introduce BehaviorHub, a large-scale human behavior simulation dataset in scene-aware, complicated tasks. Considering the unaffordable acquisition of human-authored 3D human behavior data, we construct BehaviorHub by exploring the commonsense knowledge of LLMs learned from large corpora, and automatically aligning motion resources with 3D scene for knowledgeable generation. Extensive experiments on our established benchmark demonstrate that the proposed architecture leads to effective behavior planning and simulation. BehaviorHub also proves beneficial for downstream task development. Our code and dataset will be publicly released.

Wed 22 Oct. 14:15 - 16:15 PDT

#70
Reference-based Super-Resolution via Image-based Retrieval-Augmented Generation Diffusion

Byeonghun Lee · Hyunmin Cho · Honggyu Choi · Soo Min Kang · ILJUN AHN · Kyong Hwan Jin

Most existing diffusion models have primarily utilized reference images for image-to-image translation rather than for super-resolution (SR). In SR-specific tasks, diffusion methods are only dependent on low-resolution (LR) inputs, limiting their ability to leverage reference information. Prior reference-based diffusion SR methods have demonstrated that incorporating appropriate reference images can significantly enhance reconstruction quality; however, identifying suitable references in real-world scenarios remains a critical challenge. Recently, Retrieval-Augmented Generation (RAG) has emerged as an effective framework that integrates retrieval-based and generation-based information from databases to enhance the accuracy and relevance of response to a given query. Inspired by RAG, we propose an image-based RAG framework (iRAG) for realistic super-resolution. iRAG employs a trainable hashing function to effectively retrieve either real-world or generated reference images given a query LR input. The retrieved patches are then passed to a restoration module, where they are leveraged to augment the retrieved information and generate high-fidelity super-resolved features. Furthermore, to improve the quality of generated references from pre-trained diffusion models, we adopt a hallucination filtering mechanism, leading to overall performance enhancements. Experimental results demonstrate that our approach not only resolves the practical difficulties of reference selection but also delivers superior performance compared to existing diffusion-based and non-diffusion RefSR methods.

Wed 22 Oct. 14:15 - 16:15 PDT

#71
Dual-Temporal Exemplar Representation Network for Video Semantic Segmentation

Xiaolong Xu · Lei Zhang · Jiayi Li · Lituan Wang · Yifan Guan · Yu Yan · Leyi Zhang · Hao Song

Video semantic segmentation aims to assign a class label for each pixel in every video frame. Existing methods predominantly follow the reference-target interaction paradigm, focusing on extracting local temporal contexts while neglecting the integration of global temporal information. Moreover, complex dynamics and varying lighting conditions introduce inter-frame intra-class discrepancies in feature representations, leading to unstable predictions. In this paper, we propose a novel framework, the Dual-Temporal Exemplar Representation Network (DTERN), which utilizes the strong representational capability of cluster centers, i.e., exemplars, to effectively model both local and global temporal information. DTERN consists of two core modules: 1) the Local Temporal Exemplar Module (LTEM), which constructs local exemplars to capture local temporal contexts, ensuring stable and reliable predictions. 2) the Global Temporal Exemplar Module (GTEM), which introduces learnable global exemplars to dynamically model global temporal information, thereby improving the effective consistency of segmentation. Furthermore, we observe that the existing Video Consistency (VC) metric fails to evaluate segmentation accuracy and lacks sensitivity to small-object segmentation. To this end, we propose Video Effective Consistency (VEC) to comprehensively evaluate temporal consistency and segmentation effectiveness. Experiments on VSPW and Cityscape demonstrate that DTERN outperforms state-of-the-art methods. The code is available at https://anonymous.4open.science/r/DTERN/.

Wed 22 Oct. 14:15 - 16:15 PDT

#72
Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection

Dat NGUYEN · Marcella Astrid · Anis Kacem · Enjie Ghorbel · Djamila Aouada

Detecting deepfake videos is highly challenging given the complexity of characterizing spatio-temporal artifacts. Most existing methods rely on binary classifiers trained using real and fake image sequences, therefore hindering their generalization capabilities to unseen generation methods. Moreover, with the constant progress in generative Artificial Intelligence (AI), deepfake artifacts are becoming imperceptible at both the spatial and the temporal levels, making them extremely difficult to capture. To address these issues, we propose a fine-grained deepfake video detection approach called FakeSTormer that enforces the modeling of subtle spatio-temporal inconsistencies while avoiding overfitting. Specifically, we introduce a multi-task learning framework that incorporates two auxiliary branches for explicitly attending artifact-prone spatial and temporal regions. Additionally, we propose a video-level data synthesis strategy that generates pseudo-fake videos with subtle spatio-temporal artifacts, providing high-quality samples and hand-free annotations for our additional branches. Extensive experiments on several challenging benchmarks demonstrate the superiority of our approach compared to recent state-of-the-art methods.

Wed 22 Oct. 14:15 - 16:15 PDT

#73
Multi-modal Identity Extraction

Ryan Webster · Teddy Furon

The success of multi-modal foundational models can be partly attributed to their diverse, billions scale training data. By nature, web data contains human faces and descriptions of individuals. Thus, these models pose potentially widespread privacy issues. Recently, identity membership inference attacks (IMIAs) against the CLIP model showed that membership of an individual's name and image within training data can be reliably inferred. This work formalizes the problem of identity extraction, wherein an attacker can reliably extract the names of individuals given their images only. We provide the following contributions (i) we adapt a previous IMIA to the problem of selecting the correct name among a large set and show that the method scales to millions of names (ii) we design an attack that outperforms the adapted baseline (iii) we show that an attacker can extract names via optimization only. To demonstrate the interest of our framework, we show how identity extraction can be used to audit model privacy. Indeed, a family of prominent models that advertise blurring faces before training to protect privacy is still highly vulnerable to attack.

Wed 22 Oct. 14:15 - 16:15 PDT

#74
NullSwap: Proactive Identity Cloaking Against Deepfake Face Swapping

Tianyi Wang · Shuaicheng Niu · Harry Cheng · xiao zhang · Yinglong Wang

Suffering from performance bottlenecks in passively detecting high-quality Deepfake images due to the advancement of generative models, proactive perturbations offer a promising approach to disabling Deepfake manipulations by inserting signals into benign images. However, existing proactive perturbation approaches remain unsatisfactory in several aspects: 1) visual degradation due to direct element-wise addition; 2) limited effectiveness against face swapping manipulation; 3) unavoidable reliance on white- and grey-box settings to involve generative models during training. In this study, we analyze the essence of Deepfake face swapping and argue the necessity of protecting source identities rather than target images, and we propose NullSwap, a novel proactive defense approach that cloaks source image identities and nullifies face swapping under a pure black-box scenario. We design an Identity Extraction module to obtain facial identity features from the source image, while a Perturbation Block is then devised to generate identity-guided perturbations accordingly. Meanwhile, a Feature Block extracts shallow-level image features, which are then fused with the perturbation in the Cloaking Block for image reconstruction. Furthermore, to ensure adaptability across different identity extractors in face swapping algorithms, we propose Dynamic Loss Weighting to adaptively balance identity losses. Experiments demonstrate the outstanding ability of our approach to fool various identity recognition models, outperforming state-of-the-art proactive perturbations in preventing face swapping models from generating images with correct source identities.

Wed 22 Oct. 14:15 - 16:15 PDT

#75
LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

Haiwen Huang · Anpei Chen · Volodymyr Havrylov · Andreas Geiger · Dan Zhang

Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks.

Wed 22 Oct. 14:15 - 16:15 PDT

#76
Joint Self-Supervised Video Alignment and Action Segmentation

Ali Shah Ali · Syed Ahmed Mahmood · Mubin Saeed · Andrey Konin · Zeeshan Zia · Quoc-Huy Tran

We introduce a novel approach for simultaneous self-supervised video alignment and action segmentation based on a unified optimal transport framework. In particular, we first tackle self-supervised video alignment by developing a fused Gromov-Wasserstein optimal transport formulation with a structural prior, which trains efficiently on GPUs and needs only a few iterations for solving the optimal transport problem. Our single-task method achieves the state-of-the-art performance on multiple video alignment benchmarks and outperforms VAVA, which relies on a traditional Kantorovich optimal transport formulation with an optimality prior. Furthermore, we extend our approach by proposing a unified optimal transport framework for joint self-supervised video alignment and action segmentation, which requires training and storing a single model and saves both time and memory consumption as compared to two different single-task models. Extensive evaluations on several video alignment and action segmentation datasets demonstrate that our multi-task method achieves comparable video alignment yet superior action segmentation results over previous methods in video alignment and action segmentation respectively. Finally, to the best of our knowledge, this is the first work to unify video alignment and action segmentation into a single model.

Wed 22 Oct. 14:15 - 16:15 PDT

#77
VSSD: Vision Mamba with Non-Causal State Space Duality

Yuheng Shi · Mingjia Li · Minjing Dong · Chang Xu

Vision transformers have significantly advanced the field of computer vision, offering robust modeling capabilities and global receptive field. However, their high computational demands limit their applicability in processing long sequences. To tackle this issue, State Space Models (SSMs) have gained prominence in vision tasks as they offer linear computational complexity. Recently, State Space Duality (SSD), an improved variant of SSMs, was introduced in Mamba2 to enhance model performance and efficiency. However, the inherent causal nature of SSD/SSMs restricts their applications in non-causal vision tasks. To address this limitation, we introduce Visual State Space Duality (VSSD) model, which has a non-causal format of SSD. Specifically, we propose to discard the magnitude of interactions between the hidden state and tokens while preserving their relative weights, which relieves the dependencies of token contribution on previous tokens. Together with the involvement of multi-scan strategies, we show that the scanning results can be integrated to achieve non-causality, which not only improves the performance of SSD in vision tasks but also enhances its efficiency. We conduct extensive experiments on various benchmarks including image classification, detection, and segmentation, where VSSD surpasses existing state-of-the-art SSM-based models.

Wed 22 Oct. 14:15 - 16:15 PDT

#78
EgoM2P: Egocentric Multimodal Multitask Pretraining

Gen Li · Yutong Chen · Yiqian Wu · KAIFENG ZHAO · Marc Pollefeys · Siyu Tang

Understanding multimodal signals in egocentric vision, such as RGB video, depth, camera poses, and gaze, is essential for applications in augmented reality, robotics, and human-computer interaction, enabling systems to better interpret the camera wearer’s actions, intentions, and surrounding environment. However, building large-scale egocentric multimodal andmultitask models presents unique challenges. Egocentric data are inherently heterogeneous, with large variations in modality coverage across devices and settings. Generating pseudo-labels for missing modalities, such as gaze or head-mounted camera trajectories, is often infeasible, making standard supervised learning approaches difficult to scale. Furthermore, dynamic camera motion and the complex temporal and spatial structure of first-person video pose additional challenges for the direct application of existing multimodal foundation models.To address these challenges, we introduce a set of efficient temporal tokenizers and propose EgoMLVM, a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose video model for egocentric understanding. This unified design supports multitasking across diverse egocentric perception and synthesis tasks, including gaze prediction, egocentric camera tracking, and monocular depth estimation from egocentric video, and also serves as a generative model for conditional egocentric video synthesis. Across these tasks, EgoMLVM matches or outperforms specialist models while being an order of magnitude faster. To support the community and advance egocentric vision research, we will fully open-source EgoMLVM, along with the training and evaluation code.

Wed 22 Oct. 14:15 - 16:15 PDT

#79
Efficient Multi-Person Motion Prediction by Lightweight Spatial and Temporal Interactions

Yuanhong Zheng · Ruixuan Yu · Jian Sun

3D multi-person motion prediction is a highly complex task, primarily due to the dependencies on both individual past movements and the interactions between agents. Moreover, effectively modeling these interactions often incurs substantial computational costs. In this work, we propose a computationally efficient model for multi-person motion prediction by simplifying spatial and temporal interactions. Our approach begins with the design of lightweight dual branches that learn local and global representations for individual and multiple persons separately. Additionally, we introduce a novel cross-level interaction block to integrate the spatial and temporal representations from both branches. To further enhance interaction modeling, we explicitly incorporate the spatial inter-person distance embedding. With above efficient temporal and spatial design, we achieve state-of-the-art performance for multiple metrics on standard datasets of CMU-Mocap, MuPoTS-3D, and 3DPW, while significantly reducing the computational cost.

Wed 22 Oct. 14:15 - 16:15 PDT

#80
E-NeMF: Event-based Neural Motion Field for Novel Space-time View Synthesis of Dynamic Scenes

Yan Liu · Zehao Chen · Haojie Yan · De Ma · Huajin Tang · Qian Zheng · Gang Pan

Synthesizing novel space-time views from a monocular video is a highly ill-posed problem, and its effectiveness relies on accurately reconstructing motion and appearance of the dynamic scene.Frame-based methods for novel space-time view synthesis in dynamic scenes rely on simplistic motion assumptions due to the absence of inter-frame cues, which makes them fall in complex motion. Event camera captures inter-frame cues with high temporal resolution, which makes it hold the promising potential to handle high speed and complex motion. However, it is still difficult due to the event noise and sparsity. To mitigate the impact caused by event noise, we propose E-NeMF, which alleviates the impact of event noise with Parametric Motion Representation and mitigates the event sparsity with Flow Prediction Module. Experiments on multiple real-world datasets demonstrate our superior performance in handling high-speed and complex motion.

Wed 22 Oct. 14:15 - 16:15 PDT

#81
LayerAnimate: Layer-level Control for Animation

Yuxue Yang · Lue Fan · Zuzeng Lin · Feng Wang · Zhaoxiang Zhang

Traditional animation production decomposes visual elements into discrete layers to enable independent processing for sketching, refining, coloring, and in-betweening. Existing anime generation video methods typically treat animation as a distinct data domain different from real-world videos, lacking fine-grained control at the layer level. To bridge this gap, we introduce LayerAnimate, a novel video diffusion framework with layer-aware architecture that empowers the manipulation of layers through layer-level controls. The development of a layer-aware framework faces a significant data scarcity challenge due to the commercial sensitivity of professional animation assets. To address the limitation, we propose a data curation pipeline featuring Automated Element Segmentation and Motion-based Hierarchical Merging. Through quantitative and qualitative comparisons and user study, we demonstrate that LayerAnimate outperforms current methods in terms of animation quality, control precision, and usability, making it an effective tool for both professional animators and amateur enthusiasts. This framework opens up new possibilities for layer-level animation applications and creative flexibility. The code will be available upon publication.

Wed 22 Oct. 14:15 - 16:15 PDT

#82
AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

Junhao Cheng · Yuying Ge · Yixiao Ge · Jing Liao · Ying Shan

Recent advancements in image and video synthesis have opened up new promise in generative games. One particularly intriguing application is transforming characters from anime films into interactive, playable entities. This allows players to immerse themselves in the dynamic anime world as their favorite characters for life simulation through language instructions. Such games are defined as ``infinite game'' since they eliminate predetermined boundaries and fixed gameplay rules, where players can interact with the game world through open-ended language and experience ever-evolving storylines and environments. Recently, a pioneering approach for infinite anime life simulation employs large language models (LLMs) to translate multi-turn text dialogues into language instructions for image generation. However, it neglects historical visual context, leading to inconsistent gameplay. Furthermore, it only generates static images, failing to incorporate the dynamics necessary for an engaging gaming experience. In this work, we propose AnimeGamer, which is built upon Multimodal Large Language Models (MLLMs) to generate each game state, including dynamic animation shots that depict character movements and updates to character states, as illustrated in Figure 1. We introduce novel action-aware multimodal representations to represent animation shots, which can be decoded into high-quality video clips using a video diffusion model. By taking historical animation shot representations as context and predicting subsequent representations, AnimeGamer can generate games with contextual consistency and satisfactory dynamics. Extensive evaluations using both automated metrics and human evaluations demonstrate that AnimeGamer outperforms existing methods in various aspects of the gaming experience.

Wed 22 Oct. 14:15 - 16:15 PDT

#83
HUMOTO: A 4D Dataset of Mocap Human Object Interactions

Jiaxin Lu · Chun-Hao Huang · Uttaran Bhattacharya · Qixing Huang · Yi Zhou

We present Human Motions with Objects (HUMOTO), a high-fidelity dataset of human-object interactions for motion generation, computer vision, and robotics applications. Featuring 736 sequences (7,875 seconds at 30 fps), HUMOTO captures interactions with 63 precisely modeled objects and 72 articulated parts. Our innovations include a scene-driven LLM scripting pipeline creating complete, purposeful tasks with natural progression, and a mocap-and-camera recording setup to effectively handle occlusions. Spanning diverse activities from cooking to outdoor picnics, HUMOTO preserves both physical accuracy and logical task flow. Professional artists rigorously clean and verify each sequence, minimizing foot sliding and object penetrations. We also provide benchmarks compared to other datasets. HUMOTO’s comprehensive full-body motion and simultaneous multi-object interactions address key data-capturing challenges and provide opportunities to advance realistic human-object interaction modeling across research domains with practical applications in animation, robotics, and embodied AI systems. Project: https://anonymous.4open.science/w/humoto-1782/ .

Wed 22 Oct. 14:15 - 16:15 PDT

#84
Highlight
InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity

Liming Jiang · Qing Yan · Yumin Jia · Zichuan Liu · Hao Kang · Xin Lu

Achieving flexible and high-fidelity identity-preserved image generation remains formidable, particularly with advanced Diffusion Transformers (DiTs) like FLUX. We introduce InfiniteYou (InfU), one of the earliest robust frameworks leveraging DiTs for this task. InfU addresses significant issues of existing methods, such as insufficient identity similarity, poor text-image alignment, and low generation quality and aesthetics. Central to InfU is InfuseNet, a component that injects identity features into the DiT base model via residual connections, enhancing identity similarity while maintaining generation capabilities. A multi-stage training strategy, including pretraining and supervised fine-tuning (SFT) with synthetic single-person-multiple-sample (SPMS) data, further improves text-image alignment, ameliorates image quality, and alleviates face copy-pasting. Extensive experiments demonstrate that InfU achieves state-of-the-art performance, surpassing existing baselines. In addition, the plug-and-play design of InfU ensures compatibility with various existing methods, offering a valuable contribution to the broader community.

Wed 22 Oct. 14:15 - 16:15 PDT

#85
Enhanced Pansharpening via Quaternion Spatial-Spectral Interactions

Dong Li · Chunhui Luo · Yuanfei Bao · Gang Yang · Jie Xiao · Xueyang Fu · Zheng-Jun Zha

Pansharpening aims to generate high-resolution multispectral (MS) images by fusing panchromatic (PAN) images with corresponding low-resolution MS images. However, many existing methods struggle to fully capture spatial and spectral interactions, limiting their effectiveness. To address this, we propose a novel quaternion-based spatial-spectral interaction network that enhances pansharpening by leveraging the compact representation capabilities of quaternions for high-dimensional data. Our method consists of three key components: the quaternion global spectral interaction branch, the quaternion local spatial structure awareness branch, and the quaternion spatial-spectral interaction branch. The first applies the quaternion Fourier transform to convert multi-channel features into the frequency domain as a whole, enabling global information interaction while preserving inter-channel dependencies, which aids spectral fidelity. The second uses a customized spatial quaternion representation, combined with a window-shifting strategy, to maintain local spatial dependencies while promoting spatial interactions, which helps inject spatial details. The last integrates the two pathways within the quaternion framework to enrich spatial-spectral interactions for richer representations. By utilizing quaternion’s multi-dimensional representation and parameter-sharing properties, our method achieves a more compact and efficient cross-resolution, multi-band information integration, significantly improving the quality of the fused image. Extensive experiments validate the proposed method’s effectiveness and its superior performance over current SOTA techniques. Code will be publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#86
CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games

Peng Chen · Pi Bu · Yingyao Wang · Xinyi Wang · Ziming Wang · Jie Guo · Yingxiu Zhao · Qi Zhu · Jun Song · Siran Yang · Jiamang Wang · Bo Zheng

Recent advances in Vision-Language-Action models (VLAs) have expanded the capabilities of embodied intelligence. However, significant challenges remain in real-time decision-making in complex 3D environments, which demand second-level responses, high-resolution perception, and tactical reasoning under dynamic conditions. To advance the field, we introduce CombatVLA, an efficient VLA model optimized for combat tasks in 3D action role-playing games(ARPGs). Specifically, our CombatVLA is a 3B model trained on video-action pairs collected by an action tracker, where the data is formatted as action-of-thought (AoT) sequences. Thereafter, CombatVLA seamlessly integrates into an action execution framework, allowing efficient inference through our truncated AoT strategy. Experimental results demonstrate that CombatVLA not only outperforms all existing models on the combat understanding benchmark but also achieves a 50-fold acceleration in game combat. Moreover, it has a higher task success rate than human players. We will open-sourcing all resources, including the action tracker, dataset, model weights, training code, and framework implementation.

Wed 22 Oct. 14:15 - 16:15 PDT

#87
GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling

Pinxin Liu · Luchuan Song · Junhua Huang · Haiyang Liu · Chenliang Xu

Generating full-body human gestures based on speech signals remains challenges on quality and speed. Existing approaches model different body regions such as body, legs and hands separately, which fail to capture the spatial interactions between them and result in unnatural and disjointed movements. Additionally, their autoregressive/diffusion-based pipelines show slow generation speed due to dozens of inference steps. To address these two challenges, we propose \textbf{GestureLSM}, a flow-matching-based approach for Co-Speech Gesture Generation with spatial-temporal modeling. Our method i) explicitly model the interaction of tokenized body regions through spatial and temporal attention, for generating coherent full-body gestures. ii) introduce the flow matching to enable more efficient sampling by explicitly modeling the latent velocity space. To overcome the suboptimal performance of flow matching baseline, we propose latent shortcut learning and beta distribution time stamp sampling during training to enhance gesture synthesis quality and accelerate inference. Combining the spatial-temporal modeling and improved flow matching-based framework, GestureLSM achieves state-of-the-art performance on BEAT2 while significantly reducing inference time compared to existing methods, highlighting its potential for enhancing digital humans and embodied agents in real-world applications.

Wed 22 Oct. 14:15 - 16:15 PDT

#88
MonSTeR: a Unified Model for Motion, Scene, Text Retrieval

Luca Collorone · Matteo Gioia · Massimiliano Pappa · Paolo Leoni · Giovanni Ficarra · Or Litany · Indro Spinelli · Fabio Galasso

Intention drives human movement in complex environments, but such movement can only happen if the surrounding context supports it.Despite the intuitive nature of this mechanism, existing research has not yet provided tools to evaluate the alignment between skeletal movement (motion), intention (text), and the surrounding context (scene).In this work, we introduce MonSTeR, the first MOtioN-Scene-TExt Retrieval model. Inspired by the modeling of higher-order relations, MonSTeR constructs a unified latent space by leveraging unimodal and cross-modal representations.This allows MonSTeR to capture the intricate dependencies between modalities, enabling flexible but robust retrieval across various tasks.Our results show that MonSTeR significantly outperforms models that rely solely on unimodal representations. Furthermore, we validate the alignment of our retrieval scores with human preferences through a dedicated user study. We demonstrate the versatility of MonSTeR's latent space on zero-shot in-Scene Object Placement and Motion Captioning. Code and pre-trained models will be made publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#89
Highlight
UDC-VIT: A Real-World Video Dataset for Under-Display Cameras

Kyusu Ahn · JiSoo Kim · Sangik Lee · HyunGyu Lee · Byeonghyun Ko · Chanwoo Park · Jaejin Lee

Under Display Camera (UDC) is an advanced imaging system that places a digital camera lens underneath a display panel, effectively concealing the camera. However, the display panel significantly degrades captured images or videos, introducing low transmittance, blur, noise, and flare issues. Tackling such issues is challenging because of the complex degradation of UDCs, including diverse flare patterns. Despite extensive research on UDC images and their restoration models, studies on videos have yet to be significantly explored. While two UDC video datasets exist, they primarily focus on unrealistic or synthetic UDC degradation rather than real-world UDC degradation. In this paper, we propose a real-world UDC video dataset called UDC-VIX. Unlike existing datasets, only UDC-VIX exclusively includes human motions that target facial recognition. We propose a video-capturing system to simultaneously acquire non-degraded and UDC-degraded videos of the same scene. Then, we align a pair of captured videos frame by frame, using discrete Fourier transform (DFT). We compare UDC-VIX with six representative UDC still image datasets and two existing UDC video datasets. Using six deep-learning models, we compare UDC-VIX and an existing synthetic UDC video dataset. The results indicate the ineffectiveness of models trained on earlier synthetic UDC video datasets, as they do not reflect the actual characteristics of UDC-degraded videos. We also demonstrate the importance of effective UDC restoration by evaluating face recognition accuracy concerning PSNR, SSIM, and LPIPS scores. UDC-VIX enables further exploration in the UDC video restoration and offers better insights into the challenge. UDC-VIX is available at our project site.

Wed 22 Oct. 14:15 - 16:15 PDT

#90
Nautilus: Locality-aware Autoencoder for Scalable Mesh Generation

Yuxuan Wang · Xuanyu Yi · Haohan Weng · Qingshan Xu · xiaokang wei · Xianghui Yang · Chunchao Guo · Long Chen · Hanwang Zhang

Triangle meshes are fundamental to 3D applications. Current automatic mesh generation methods typically rely on intermediate representations that lack the continuous surface quality inherent to meshes. Converting these representations into meshes produces dense, suboptimal outputs. Although recent autoregressive approaches demonstrate promise in directly modeling mesh vertices and faces, they are constrained by the limitation in face count, scalability, and structural fidelity.To address these challenges, we propose Nautilus, a locality-aware autoencoder for artist-like mesh generation that leverages the local properties of manifold meshes to achieve structural fidelity and efficient representation. Our approach introduces a novel tokenization algorithm that preserves face proximity relationships and compresses sequence length through locally shared vertices and edges, enabling the generation of meshes with an unprecedented scale of up to 5,000 faces. Furthermore, we develop a Dual-stream Point Conditioner that captures fine-grained geometric features, ensuring global consistency and local structural fidelity.Extensive experiments demonstrate that Nautilus significantly outperforms state-of-the-art methods in generation quality.

Wed 22 Oct. 14:15 - 16:15 PDT

#91
Blind Video Super-Resolution based on Implicit Kernels

Qiang Zhu · Yuxuan Jiang · Shuyuan Zhu · Fan Zhang · David Bull · Bing Zeng

Blind video super-resolution (BVSR) is a low-level vision task which aims to generate high-resolution videos from low-resolution counterparts in unknown degradation scenarios. Existing approaches typically predict blur kernels that are spatially invariant in each video frame or even the entire video. These methods do not consider potential spatio-temporal varying degradations in videos, resulting in suboptimal BVSR performance. In this context, we propose a novel BVSR model based on Implicit Kernels, BVSR-IK, which constructs a multi-scale kernel dictionary parameterized by implicit neural representations. It also employs a newly designed recurrent Transformer to predict the coefficient weights for accurate filtering in both frame correction and feature alignment. Experimental results have demonstrated the effectiveness of the proposed BVSR-IK, when compared with four state-of-the-art BVSR models on three commonly used datasets, with BVSR-IK outperforming the second best approach, FMA-Net, by up to 0.59 dB in PSNR. Source code will be available at https://github.com.

Wed 22 Oct. 14:15 - 16:15 PDT

#92
Highlight
F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration

Lu Liu · Huiyu Duan · Qiang Hu · Liu Yang · Chunlei Cai · Tianxiao Ye · Huayu Liu · Xiaoyun Zhang · Guangtao Zhai

Recent artificial intelligence (AI) generative models have demonstrated remarkable capabilities in image production, and have been widely applied to face image generation, customization, and restoration. However, many AI-generated faces (AIGFs) still suffer from issues such as unique distortions, unrealistic details, and unexpected identity shifts, underscoring the need for a comprehensive quality evaluation method for AIGFs. To this end, we introduce FaceQ, the first comprehensive AI-generated Face image database with fine-grained Quality annotations aligned with human preferences, which consists of 12K images and 491K ratings across multiple dimensions. Using the FaceQ database, we establish F-Bench, a benchmark for comparing and evaluating face generation, customization, and restoration models, highlighting strengths and weaknesses across various prompts and evaluation dimensions. Additionally, we assess the performance of existing image quality assessment (IQA) methods on FaceQ, and further propose a large multimodal model (LMM) based Face quality Evaluator (F-Eval) to accurately assess the multi-dimensional quality of generated faces in a one-for-all manner. Extensive experimental results demonstrate the state-of-the-art performance of our F-Eval. FaceQ, F-Bench, and F-Eval will be publicly available upon publication.

Wed 22 Oct. 14:15 - 16:15 PDT

#93
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers For Motion Transfer

Qingyu Shi · Jianzong Wu · Jinbin Bai · Lu Qi · Jiangning Zhang · Yunhai Tong · Xiangtai Li

The motion transfer task involves transferring motion from a source video to newly generated videos, requiring the model to decouple motion from appearance. Previous diffusion-based methods primarily rely on separate spatial and temporal attention mechanisms within 3D U-Net. In contrast, state-of-the-art Diffusion Transformer (DiT) models use 3D full attention, which does not explicitly separate temporal and spatial information. Thus, the interaction between spatial and temporal dimensions makes decoupling motion and appearance more challenging for DiT models. In this paper, we propose DeT, a method that adapts DiT models to improve motion transfer ability. Our approach introduces a simple yet effective temporal kernel to smooth DiT features along the temporal dimension, facilitating the decoupling of foreground motion from background appearance. Meanwhile, the temporal kernel effectively captures temporal variations in DiT features, which are closely related to motion. Moreover, we introduce explicit supervision along trajectories in the latent feature space to further enhance motion consistency. Additionally, we present MTBench, a general and challenging benchmark for motion transfer. We also introduce a hybrid motion fidelity metric which consider both the global and local similarity of motion. Therefore our work provides a more comprehensive evaluation than previous works. Extensive experiments on MTBench demonstrate that DeT achieves the best trade-off between motion fidelity and edit fidelity. The source code and trained models will be made available to the public.

Wed 22 Oct. 14:15 - 16:15 PDT

#94
Latent Swap Joint Diffusion for 2D Long-Form Latent Generation

Yusheng Dai · Chenxi Wang · Chang Li · Chen Wang · Kewei Li · Jun Du · Lei Sun · Jianqing Gao · Ruoyu Wang · Jiefeng Ma

This paper introduces Swap Forward (SaFa), a modality-agnostic and efficient method to generate seamless and coherence long spectrum and panorama through latent swap joint diffusion across multi-views. We first investigate the spectrum aliasing problem in spectrum-based audio generation caused by existing joint diffusion methods. Through a comparative analysis of the VAE latent representation of Mel-spectra and RGB images, we identify that the failure arises from excessive suppression of high-frequency components during the spectrum denoising process due to the averaging operator. To address this issue, we propose Self-Loop Latent Swap, a frame-level bidirectional swap applied to the overlapping region of adjacent views. Leveraging stepwise differentiated trajectories of adjacent subviews, this swap operator adaptively enhances high-frequency components and avoid spectrum distortion. Furthermore, to improve global cross-view consistency in non-overlapping regions, we introduce Reference-Guided Latent Swap, a unidirectional latent swap operator that provides a centralized reference trajectory to synchronize subview diffusions. By refining swap timing and intervals, we can achieve a cross-view similarity-diversity balance in a forward-only manner. Quantitative and qualitative experiments demonstrate that SaFa significantly outperforms existing joint diffusion methods and even training-based methods in audio generation using both U-Net and DiT models, along with effective longer length adaptation. It also adapts well to panorama generation, achieving comparable performance with 2 $\sim$ 20$\times$ faster speed and greater model generalizability. More generation demos are available at https://swapforward.github.io/.

Wed 22 Oct. 14:15 - 16:15 PDT

#95
Blind Noisy Image Deblurring Using Residual Guidance Strategy

Heyan Liu · Jianing Sun · Jun Liu · Xi-Le Zhao · Tingting WU · Tieyong Zeng

Blind deblurring is an ill-posed inverse problem that involves recovering both the clear image and the blur kernel from a single blurry image. In real photography, longer exposure times result in lots of noise in the blurry image. Although existing blind deblurring methods produce satisfactory results on blurred images with little or no noise, they struggle to handle high noise levels. Strong noise compromises the accuracy of the estimated kernel and significantly reduces the quality of the deblurring results. To address this challenge, we propose a Residual Guidance Strategy (RGS) to suppress the influence of noise. Our method leverages adjacent coarser-scale information in the image pyramid to guide the blur kernel estimation in the current scale. Therefore, for blurred images with unknown noise levels and types, our method still estimates more accurate blur kernels, which are essential for subsequent non-blind restoration. Extensive experiments on both synthetic and real datasets have demonstrated that our method consistently outperforms numerous state-of-the-art methods under high levels of noise quantitatively and qualitatively.

Wed 22 Oct. 14:15 - 16:15 PDT

#96
Drawing Developmental Trajectory from Cortical Surface Reconstruction

WENXUAN WU · ruowen qu · Zhongliang Liu · Zhuoyan Dai · Dongzi Shi · Sijin Yu · Tong Xiong · Shiping Liu · Xiangmin Xu · Xiaofen Xing · Xin Zhang

Diffeomorphic-based cortical surface reconstruction typically involves a series of deformation processes to extract the cerebral cortex from brain magnetic resonance images (MRI). While most methods are designed for adult brains using Neural Ordinary Differential Equations (NODE) with fixed step sizes, the neonatal brain, which exhibits dramatic changes in cortical folding patterns early in life, requires a more adaptive approach. To address this, we develop a dual-task framework to directly characterize the brain development trajectory through processes of cortical surface reconstruction. For white matter (inner surfaces), we employ an Age-Conditioned ODE with adaptive step sizes. It is initially trained on a limited set of longitudinal paired data to establish a coarse trajectory, which is then refined through sample training of single-point data and knowledge distillation. For the pial surfaces (outer surfaces), we position the midthickness surfaces as intermediates and employ a cycle-consistent semi-supervised training strategy to depict a coherent brain development trajectory between the inner and outer surfaces. Our approach is the first to achieve precise developmental prediction directly on triangular meshes. Furthermore, by enhancing interpretability at each stage of the deformation process, this approach improves the applicability of diffeomorphic-based methods. The proposed method has demonstrated state-of-the-art performance in modeling developmental trajectories and cortical surface reconstruction within the developing Human Connectome Project dataset (dHCP).

Wed 22 Oct. 14:15 - 16:15 PDT

#97
DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

Yuxuan Luo · Zhengkun Rong · Lizhen Wang · Longhao Zhang · Tianshu Hu

While recent image-based human animation methods achieve realistic body and facial motion synthesis, critical gaps remain in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence, which leads to their lower expressiveness and robustness. We propose a diffusion transformer (DiT) based framework, HERA, with hybrid guidance to overcome these limitations. For motion guidance, our hybrid control signals that integrate implicit facial representations, 3D head spheres, and 3D body skeletons achieve robust control of facial expressions and body movements, while producing expressive and identity-preserving animations.For scale adaptation, to handle various body poses and image scales ranging from portraits to full-body views, we employ a progressive training strategy using data with varying resolutions and scales.For appearance guidance, we integrate motion patterns from sequential frames with complementary visual references, ensuring long-term temporal coherence for unseen regions during complex movements.Experiments demonstrate that our method outperforms the state-of-the-art works, delivering expressive results for portraits, upper-body, and full-body generation with robust long-term consistency.

Wed 22 Oct. 14:15 - 16:15 PDT

#98
Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

Yudong Jin · Sida Peng · Xuan Wang · Tao Xie · Zhen Xu · Yifan Yang · Yujun Shen · Hujun Bao · Xiaowei Zhou

This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion models to generate videos at novel viewpoints. However, the generated videos from these models often lack spatio-temporal consistency, thus degrading view synthesis quality. In this paper, we propose a novel sliding iterative denoising process to enhance the spatio-temporal consistency of the 4D diffusion model. Specifically, we define a latent grid in which each latent encodes the image, camera pose and human pose for a certain viewpoint and timestamp, then alternately denoising the latent grid along spatial and temporal dimensions with a sliding window, and finally decode the videos at target viewpoints from the corresponding denoised latents. Through the iterative sliding, information flows sufficiently across the latent grid, allowing the diffusion model to obtain a large receptive field and thus enhance the 4D consistency of the output, while making the memory consumption affordable. The experiments on the DNA-Rendering and ActorsHQ datasets demonstrate that our method is able to synthesize high-quality and consistent novel-view videos and outperforms the existing methods by a large margin. Our code and dataset will be released.

Wed 22 Oct. 14:15 - 16:15 PDT

#99
DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation

Jiangran Lyu · Ziming Li · Xuesong Shi · Chaoyi Xu · Yizhou Wang · He Wang

Nonprehensile manipulation is crucial for handling objects that are too thin, large, or otherwise ungraspable in unstructured environments. While conventional planning-based approaches struggle with complex contact modeling, learning-based methods have recently emerged as a promising alternative. However, existing learning-based approaches face two major limitations: they heavily rely on multi-view cameras and precise pose tracking, and they fail to generalize across varying physical conditions, such as changes in object mass and table friction. To address these challenges, we propose the Dynamics-Adaptive World Action Model (DyWA), a novel framework that enhances action learning by jointly predicting future states while adapting to dynamics variations based on historical trajectories. By unifying the modeling of geometry, state, physics, and robot actions, DyWA enables more robust policy learning under partial observability.Compared to baselines, our method improves the success rate by 31.5\% using only single-view point cloud observations in the simulation. Furthermore, DyWA achieves an average success rate of 68\% in real-world experiments, demonstrating its ability to generalize across diverse object geometries, adapt to varying table friction, and robustness in challenging scenarios such as half-filled water bottles and slippery surfaces.

Wed 22 Oct. 14:15 - 16:15 PDT

#100
Less is More: Improving Motion Diffusion Models with Sparse Keyframes

Jinseok Bae · Inwoo Hwang · Young-Yoon Lee · Ziyu Guo · Joseph Liu · Yizhak Ben-Shabat · Young Min Kim · Mubbasir Kapadia

Recent advances in motion diffusion models have led to remarkable progress in diverse motion generation tasks, including text-to-motion synthesis.However, existing approaches represent motions as dense frame sequences, requiring the model to process redundant or less informative frames.The processing of dense animation frames imposes significant training complexity, especially when learning intricate distributions of large motion datasets even with modern neural architectures. This severely limits the performance of generative motion models for downstream tasks.Inspired by professional animators who mainly focus on sparse keyframes, we propose a novel diffusion framework explicitly designed around sparse and geometrically meaningful keyframes.Our method reduces computation by masking non-keyframes and efficiently interpolating missing frames. We dynamically refine the keyframe mask during inference to prioritize informative frames in later diffusion steps.Extensive experiments show that our approach consistently outperforms state-of-the-art methods in text alignment and motion realism, while also effectively maintaining high performance at significantly fewer diffusion steps.We further validate the robustness of our framework by using it as a generative prior and adapting it to different downstream tasks. Source code and pre-trained models will be released upon acceptance.

Wed 22 Oct. 14:15 - 16:15 PDT

#101
DGTalker: Disentangled Generative Latent Space Learning for Audio-Driven Gaussian Talking Heads

Xiaoxi Liang · Yanbo Fan · Qiya Yang · Xuan Wang · Wei Gao · Ge Li

In this work, we investigate the generation of high-fidelity, audio-driven 3D Gaussian talking heads from monocular videos. We present DGTalker, an innovative framework designed for real-time, high-fidelity, and 3D-aware talking head synthesis. By leveraging Gaussian generative priors and treating the task as a latent space navigation problem, our method effectively alleviates the lack of 3D information and the low-quality detail reconstruction caused by overfitting to training views in monocular videos, which has been a longstanding challenge in existing 3DGS-based approaches. To ensure precise lip synchronization and nuanced expression control, we propose a disentangled latent space navigation framework that independently models lip motion and upper-face expressions. Additionally, we introduce an effective masked cross-view supervision strategy to enable robust learning within the disentangled latent space. We conduct extensive experiments and demonstrate that DGTalker surpasses current state-of-the-art methods in visual quality, motion accuracy, and controllability.

Wed 22 Oct. 14:15 - 16:15 PDT

#102
VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers

Yating Wang · Haoyi Zhu · Mingyu Liu · Jiange Yang · Hao-Shu Fang · Tong He

In this paper, we introduce an innovative vector quantization based action tokenizer built upon the largest-scale action trajectory dataset to date, leveraging over 100 times more data than previous approaches. This extensive dataset enables our tokenizer to capture rich spatiotemporal dynamics, resulting in a model that not only accelerates inference but also generates smoother and more coherent action outputs. Once trained, the tokenizer can be seamlessly adapted to a wide range of downstream tasks in a zero-shot manner, from short-horizon reactive behaviors to long-horizon planning. A key finding of our work is that the domain gap between synthetic and real action trajectories is marginal, allowing us to effectively utilize a vast amount of synthetic data during training without compromising real-world performance. To validate our approach, we conducted extensive experiments in both simulated environments and on real robotic platforms. The results demonstrate that as the volume of synthetic trajectory data increases, the performance of our tokenizer on downstream tasks improves significantly—most notably, achieving up to a 30\% higher success rate on two real-world tasks in long-horizon scenarios.These findings highlight the potential of our action tokenizer as a robust and scalable solution for real-time embodied intelligence systems, paving the way for more efficient and reliable robotic control in diverse application domains.

Wed 22 Oct. 14:15 - 16:15 PDT

#103
Augmented and Softened Matching for Unsupervised Visible-Infrared Person Re-Identification

Zhiqi Pang · Chunyu Wang · Lingling Zhao · Junjie Wang

Color variations, a key challenge in the unsupervised visible-infrared person re-identification (UVI-ReID) task, have garnered significant attention. While existing UVI-ReID methods have made substantial efforts during the optimization phase to enhance the model’s robustness to color variations, they often overlook the impact of color variations on the acquisition of pseudo-labels. To address this, in this paper, we focus on improving the robustness of pseudo-labels to color variations through data augmentation and propose an augmented and softened matching (ASM) method. Specifically, we first develop the cross-modality augmented matching (CAM) module, which performs channel augmentation on visible images to generate augmented images. Then, based on the fusion of the visible-infrared and augmented-infrared centroid similarity matrices, CAM establishes cross-modality correspondences that are robust to color variations. To increase training stability, we design a soft-labels momentum update (SMU) strategy, which converts traditional one-hot labels into soft-labels through momentum updates, thus adapting to CAM. During the optimization phase, we introduce the cross-modality soft contrastive loss and cross-modality hard contrastive loss to promote modality-invariant learning from the perspectives of shared and diversified features, respectively. Extensive experimental results validate the effectiveness of the proposed method, showing that ASM not only outperforms state-of-the-art unsupervised methods but also competes with some supervised methods.

Wed 22 Oct. 14:15 - 16:15 PDT

#104
Temporal Unlearnable Examples: Preventing Personal Video Data from Unauthorized Exploitation by Object Tracking

Qiangqiang Wu · Yi Yu · Chenqi Kong · Ziquan Liu · Jia Wan · Haoliang Li · Alex Kot · Antoni Chan

With the rise of social media, vast amounts of user-uploaded videos (e.g., YouTube) are utilized as training data for Visual Object Tracking (VOT). However, the VOT community has largely overlooked video data-privacy issues, as many private videos have been collected and used for training commercial models without authorization. To alleviate these issues, this paper presents the first investigation on preventing personal video data from unauthorized exploitation by deep trackers. Existing methods for preventing unauthorized data use primarily focus on image-based tasks (e.g., image classification), directly applying them to videos reveals several limitations, including inefficiency, limited effectiveness, and poor generalizability. To address these issues, we propose a novel generative framework for generating Temporal Unlearnable Examples (TUEs), and whose efficient computation makes it scalable for usage on large-scale video datasets. The trackers trained w/ TUEs heavily rely on unlearnable noises for temporal matching, ignoring the original data structure and thus ensuring training video data-privacy. To enhance the effectiveness of TUEs, we introduce a temporal contrastive loss, which further corrupts the learning of existing trackers when using our TUEs for training. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in video data-privacy protection, with strong transferability across VOT models, datasets, and temporal matching tasks.

Wed 22 Oct. 14:15 - 16:15 PDT

#105
Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition

Wenhan Wu · Zhishuai Guo · Chen Chen · Hongfei Xue · Aidong Lu

Zero-shot skeleton-based action recognition aims to develop models capable of identifying actions beyond the categories encountered during training. Previous approaches have primarily focused on aligning visual and semantic representations but often overlooked the importance of fine-grained action patterns in the semantic space (e.g., the hand movements in drinking water and brushing teeth). To address these limitations, we propose a Frequency-Semantic Enhanced Variational Autoencoder (FS-VAE) to explore the skeleton semantic representation learning with frequency decomposition. FS-VAE consists of three key components: 1) a frequency-based enhancement module with high- and low-frequency adjustments to enrich the skeletal semantics learning and improve the robustness of zero-shot action recognition; 2) a semantic-based action description with multilevel alignment to capture both local details and global correspondence, effectively bridging the semantic gap and compensating for the inherent loss of information in skeleton sequences; 3) a calibrated cross-alignment loss that enables valid skeleton-text pairs to counterbalance ambiguous ones, mitigating discrepancies and ambiguities in skeleton and text features, thereby ensuring robust alignment. Evaluations on the benchmarks demonstrate the effectiveness of our approach, validating that frequency-enhanced semantic features enable robust differentiation of visually and semantically similar action clusters, thereby improving zero-shot action recognition.

Wed 22 Oct. 14:15 - 16:15 PDT

#106
Learning Hierarchical Line Buffer for Image Processing

Jiacheng Li · Feiran Li · Daisuke Iso

In recent years, neural networks have achieved significant progress in offline image processing. However, in online scenarios, particularly in on-chip implementations, memory usage emerges as a critical bottleneck due to the limited memory resources of integrated image processors. In this study, we focus on reducing the memory footprint of neural networks for on-chip image processing by optimizing network design for efficient memory utilization. Specifically, we consider a typical scenario in which images outputted from an image sensor are processed sequentially using line buffers in a line-by-line manner. This setting necessitates the modeling of both intra-line and inter-line correlations—capturing dependencies among pixels within a single line group and across different line groups, respectively.To model intra-line correlations, we propose a progressive feature enhancement strategy, where line pixels are processed with expanding strip convolutions in multiple stages.For inter-line correlation modeling, we introduce a hierarchical line buffer formulation, where features extracted from previous lines are incrementally reused and compressed across multiple hierarchical levels.Comprehensive experiments on various image processing tasks, including RAW denoising, Gaussian denoising, and super-resolution, demonstrate that the proposed method achieves a superior trade-off between performance and memory efficiency than previous solutions, e.g., up to 1dB PSNR gain in RAW denoising at one-fifth of peak memory usage.

Wed 22 Oct. 14:15 - 16:15 PDT

#107
VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

shiduo zhang · Zhe Xu · Peiju Liu · Xiaopeng Yu · Qinghui Gao · Yuan Li · Zhaoye Fei · Zhangyue Yin · Zuxuan Wu · Yu-Gang Jiang · Xipeng Qiu

General-purposed embodied agents are designed to understand the users' natural instructions or intentions and act precisely to complete universal tasks. Recently, methods based on foundation models especially Vision-Language-Action models (VLAs) have shown a substantial potential to solve language-conditioned manipulation (LCM) tasks well. However, existing benchmarks do not adequately meet the needs of VLAs and relative algorithms. To better define such general-purpose tasks in the context of LLMs and advance the research in VLAs, we present VLABench, an open-source benchmark for evaluating universal LCM task learning. VLABench provides 100 carefully designed categories of tasks, with strong randomization in each category of task and a total of 2000+ objects. VLABench stands out from previous benchmarks in four key aspects: 1) tasks requiring world knowledge and common sense transfer, 2) natural language instructions with implicit human intentions rather than templates, 3) long-horizon tasks demanding multi-step reasoning, and 4) evaluation of both action policies and language model capabilities. The benchmark assesses multiple competencies including understanding of mesh\&texture, spatial relationship, semantic instruction, physical laws, knowledge transfer and reasoning, etc. To support the downstream finetuning, we provide high-quality training data collected via an automated framework incorporating heuristic skills and prior information. The experimental results indicate that both the current state-of-the-art pretrained VLAs and the workflow based on VLMs face challenges in our tasks.

Wed 22 Oct. 14:15 - 16:15 PDT

#108
TrackVerse: A Large-Scale Object-Centric Video Dataset for Image-Level Representation Learning

Yibing Wei · Samuel Church · Victor Suciu · Jinhong Lin · Cheng-En Wu · Pedro Morgado

Video data inherently captures rich, dynamic contexts that reveal objects in varying poses, interactions, and state transitions, offering rich potential for unsupervised visual representation learning.However, existing natural video datasets are not well-suited for effective object representation learning due to their lack of object-centricity and class diversity. To address these challenges, we introduce TrackVerse, a novel large-scale video dataset for learning object representations. TrackVerse features diverse, common objects tracked over time, capturing their evolving states. To leverage temporal dynamics in TrackVerse, we extend contrastive learning with a variance-aware predictor that conditions on data augmentations, enabling models to learn state-aware representations.Extensive experiments demonstrate that representations learned from TrackVerse with variance-aware contrastive learning significantly outperform those from non-object-centric natural video and static image datasets across multiple downstream tasks including object/attributie recognition, action recognition and video instance segmentation, highlighting the rich semantic and state content in TrackVerse feature.

Wed 22 Oct. 14:15 - 16:15 PDT

#109
Reangle-A-Video: 4D Video Generation as Video-to-Video Translation

Hyeonho Jeong · Suhyeon Lee · Jong Ye

We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video. Unlike mainstream approaches that train multi-view video diffusion models on large-scale 4D datasets, our method reframes the multi-view video generation task as video-to-videos translation, leveraging publicly available image and video diffusion priors. In essence, Reangle-A-Video operates in two stages. (1) Multi-View Motion Learning: An image-to-video diffusion transformer is synchronously fine-tuned in a self-supervised manner to distill view-invariant motion from a set of warped videos. (2) Multi-View Consistent Image-to-Images Translation: The first frame of the input video is warped and inpainted into various camera perspectives under an inference-time cross-view consistency guidance using DUSt3R, generating multi-view consistent starting images. Extensive experiments on static view transport and dynamic camera control show that Reangle-A-Video surpasses existing methods, establishing a new solution for multi-view video generation. We will publicly release our code and data. Anonymous project page: https://anony1anony2.github.io/

Wed 22 Oct. 14:15 - 16:15 PDT

#110
Human-Object Interaction from Human-Level Instructions

Zhen Wu · Jiaman Li · Pei Xu · Karen Liu

Intelligent agents must autonomously interact with the environments to perform daily tasks based on human-level instructions. They need a foundational understanding of the world to accurately interpret these instructions, along with precise low-level movement and interaction skills to execute the derived actions. In this work, we propose the first complete system for synthesizing physically plausible, long-horizon human-object interactions for object manipulation in contextual environments, driven by human-level instructions. We leverage large language models (LLMs) to interpret the input instructions into detailed execution plans. Unlike prior work, our system is capable of generating detailed finger-object interactions, in seamless coordination with full-body movements. We also train a policy to track generated motions in physics simulation via reinforcement learning (RL) to ensure physical plausibility of the motion. Our experiments demonstrate the effectiveness of our system in synthesizing realistic interactions with diverse objects in complex environments, highlighting its significant potential for real-world applications.

Wed 22 Oct. 14:15 - 16:15 PDT

#111
KinMo: Kinematic-aware Human Motion Understanding and Generation

Pengfei Zhang · Pinxin Liu · Pablo Garrido · Hyeongwoo Kim · Bindita Chaudhuri

Current human motion synthesis frameworks rely on global action descriptions, creating a modality gap that limits both motion understanding and generation capabilities. A single coarse description, such as "run", fails to capture essential details like variations in speed, limb positioning, and kinematic dynamics, leading to significant ambiguities between text and motion modalities. To address this challenge, we introduce \textbf{KinMo}, a unified framework built on a hierarchical describable motion representation that extends beyond global action by incorporating kinematic group movements and their interactions.We design an automated annotation pipeline to generate high-quality, fine-grained descriptions for this decomposition, resulting in the KinMo dataset. To leverage these structured descriptions, we propose Hierarchical Text-Motion Alignment, improving spatial understanding by integrating additional motion details. Furthermore, we introduce a coarse-to-fine generation procedure to demonstrate how enhanced spatial understanding benefits motion synthesis. Experimental results show that KinMo significantly improves motion understanding, demonstrated by enhanced text-motion retrieval performance and enabling more fine-grained motion generation and editing capabilities.

Wed 22 Oct. 14:15 - 16:15 PDT

#112
Highlight
Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection

Taehoon Kim · Jongwook Choi · Yonghyun Jeong · Haeun Noh · Jaejun Yoo · Seungryul Baek · Jongwon Choi

We introduce a deepfake video detection approach that exploits pixel-wise temporal inconsistencies, which traditional spatial frequency-based detectors often overlook. The traditional detectors represent temporal information merely by stacking spatial frequency spectra across frames, resulting in the failure to detect pixel-wise temporal artifacts. Our approach performs a 1D Fourier transform on the time axis for each pixel, extracting features highly sensitive to temporal inconsistencies, especially in areas prone to unnatural movements. To precisely locate regions containing the temporal artifacts, we introduce an attention proposal module trained in an end-to-end manner. Additionally, our joint transformer module effectively integrates pixel-wise temporal frequency features with spatio-temporal context features, expanding the range of detectable forgery artifacts. Our framework represents a significant advancement in deepfake video detection, providing robust performance across diverse and challenging detection scenarios.

Wed 22 Oct. 14:15 - 16:15 PDT

#113
Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis

Lei-lei Li · Jianwu Fang · Junbin Xiao · Shanmin Pang · Hongkai Yu · Chen Lv · Jianru Xue · Tat-Seng Chua

Egocentricly comprehending the causes and effects of car accidents is crucial for the safety of self-driving cars, and synthesizing causal-entity reflected accident videos can facilitate the capability test to respond to unaffordable accidents in reality. However, incorporating causal relations as seen in real-world videos into synthetic videos remains challenging. This work argues that precisely identifying the accident participants and capturing their related behaviors are of critical importance. In this regard, we propose a novel diffusion model Causal-VidSyn for synthesizing egocentric traffic accident videos. To enable causal entity grounding in video diffusion, Causal-VidSyn leverages the cause descriptions and driver fixations to identify the accident participants and behaviors, facilitated by accident reason answering and gaze-conditioned selection modules. To support Causal-VidSyn, we further construct DriveGaze, the largest driver gaze dataset (with 1.54M frames of fixations) in driving accident scenarios. Extensive experiments show that Causal-VidSyn surpasses state-of-the-art video diffusion models in terms of frame quality and causal sensitivity in various tasks, including accident video content editing, normal-to-accident video diffusion, and text-to-video generation.

Wed 22 Oct. 14:15 - 16:15 PDT

#114
WaveMamba: Wavelet-Driven Mamba Fusion for RGB-Infrared Object Detection

Haodong Zhu · Wenhao Dong · Linlin Yang · Hong Li · Yuguang Yang · Yangyang Ren · Qingcheng Zhu · Zichao Feng · CHANGBI LI · Shaohui Lin · Runqi Wang · Xiaoyan Luo · Baochang Zhang

Leveraging the complementary characteristics of visible (RGB) and infrared (IR) imagery offers significant potential for improving object detection. In this paper, we propose WaveMamba, a cross-modality fusion method that efficiently integrates the unique and complementary frequency features of RGB and IR decomposed by Discrete Wavelet Transform (DWT). An improved detection head incorporating the Inverse Discrete Wavelet Transform (IDWT) is also proposed to reduce information loss and produce the final detection results. The core of our approach is the introduction of WaveMamba Fusion Block (WMFB), which facilitates comprehensive fusion across low-/high-frequency sub-bands. Within WMFB, the Low-frequency Mamba Fusion Block (LMFB), built upon the Mamba framework, first performs initial low-frequency feature fusion with channel swapping, followed by deep fusion with an advanced gated attention mechanism for enhanced integration. High-frequency features are enhanced using a strategy that applies an ``absolute maximum" fusion approach. These advancements lead to significant performance gains, with our method surpassing state-of-the-art approaches and achieving average mAP improvements of $4.5$\% on four benchmarks.

Wed 22 Oct. 14:15 - 16:15 PDT

#115
Robust Test-Time Adaptation for Single Image Denoising Using Deep Gaussian Prior

Qing Ma · Pengwei Liang · Xiong Zhou · Jiayi Ma · Junjun Jiang · Zhe Peng

Gaussian denoising often serves as the initiation of research in the field of image denoising, owing to its prevalence and intriguing properties. However, deep Gaussian denoiser typically generalizes poorly to other types of noises, such as Poisson noise and real-world noise. In this paper, we reveal that deep Gaussian denoisers have an underlying ability to handle other noises with only ten iterations of self-supervised learning, which is referred to as \textit{deep denoiser prior}. Specifically, we first pre-train a Gaussian denoising model in a self-supervised manner. Then, for each test image, we construct a pixel bank based on the self-similarity and randomly sample pseudo-instance examples from it to perform test-time adaptation. Finally, we fine-tune the pre-trained Gaussian denoiser using the randomly sampled pseudo-instances. Extensive experiments demonstrate that our test-time adaptation method helps the pre-trained Gaussian denoiser rapidly improve performance in removing both in-distribution and out-of-distribution noise, achieving superior performance compared to existing single-image denoising methods while also significantly reducing computational time.

Wed 22 Oct. 14:15 - 16:15 PDT

#116
MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization

Hyung Kyu Kim · Sangmin Lee · HAK GU KIM

Speech-driven 3D facial animation aims to synthesize realistic facial motion sequences from given audio, matching the speaker’s speaking style. However, previous works often require priors such as class labels of a speaker or additional 3D facial meshes at inference, which makes them fail to reflect the speaking style and limits their practical use. To address these issues, we propose \textit{MemoryTalker} which enables realistic and accurate 3D facial motion synthesis by reflecting speaker style only with audio input to maximize usability in applications. Our framework consists of two training stages: $<$1-stage$>$ is storing and retrieving general motion (\textit{i.e.}, Memorizing), and $<$2-stage$>$ is to perform the personalized facial motion synthesis (\textit{i.e.}, Animating) with the motion memory stylized by the audio-driven speaking style feature. In this second stage, our model learns about which facial motion types should be emphasized for a particular piece of audio. As a result, our \textit{MemoryTalker} can generate a reliable personalized facial animation without additional prior information. With quantitative and qualitative evaluations, as well as user study, we show the effectiveness of our model and its performance enhancement for personalized facial animation over state-of-the-art methods. Our source code will be released to facilitate further research.

Wed 22 Oct. 14:15 - 16:15 PDT

#117
Hierarchical-aware Orthogonal Disentanglement Framework for Fine-grained Skeleton-based Action Recognition

Haochen Chang · Pengfei Ren · Haoyang Zhang · Liang Xie · Hongbo Chen · Erwei Yin

In recent years, skeleton-based action recognition has gained significant attention due to its robustness in varying environmental conditions. However, most existing methods struggle to distinguish fine-grained actions due to subtle motion features, minimal inter-class variation, and they often fail to consider the underlying similarity relationships between action classes. To address these limitations, we propose a Hierarchical-aware Orthogonal Disentanglement framework (HiOD). We disentangle coarse-grained and fine-grained features by employing independent spatial-temporal granularity-aware bases, which encode semantic representations at varying levels of granularity. Additionally, we design a cross-granularity feature interaction mechanism that leverages complementary information between coarse-grained and fine-grained features. We further enhance the learning process through hierarchical prototype contrastive learning, which utilizes the parent class hierarchy to guide the learning of coarse-grained features while ensuring the distinguishability of fine-grained features within child classes. Extensive experiments on FineGYM, FSD-10, NTU RGB+D, and NTU RGB+D 120 datasets demonstrate the superiority of our method in fine-grained action recognition tasks.

Wed 22 Oct. 14:15 - 16:15 PDT

#118
Balancing Task-invariant Interaction and Task-specific Adaptation for Unified Image Fusion

Xingyu Hu · Junjun Jiang · Chenyang Wang · Kui Jiang · Xianming Liu · Jiayi Ma

Unified image fusion aims to integrate complementary information from multi-source images, enhancing image quality through a unified framework applicable to diverse fusion tasks. While treating all fusion tasks as a unified problem facilitates task-invariant knowledge sharing, it often overlooks task-specific characteristics, thereby limiting the overall performance. Existing general image fusion methods incorporate explicit task identification to enable adaptation to different fusion tasks. However, this dependence during inference restricts the model's generalization to unseen fusion tasks. To address these issues, we propose a novel unified image fusion framework named "TITA", which dynamically balances both Task-invariant Interaction and Task-specific Adaptation. For task-invariant interaction, we introduce the Interaction-enhanced Pixel Attention (IPA) module to enhance pixel-wise interactions for better multi-source complementary information extraction. For task-specific adaptation, the Operation-based Adaptive Fusion (OAF) module dynamically adjusts operation weights based on task properties. Additionally, we incorporate the Fast Adaptive Multitask Optimization (FAMO) strategy to mitigate the impact of gradient conflicts across tasks during joint training. Extensive experiments demonstrate that TITA not only achieves competitive performance compared to specialized methods across three image fusion scenarios but also exhibits strong generalization to unseen fusion tasks.

Wed 22 Oct. 14:15 - 16:15 PDT

#119
EVOLVE: Event-Guided Deformable Feature Transfer and Dual-Memory Refinement for Low-Light Video Object Segmentation

Jong Hyeon Baek · Jiwon oh · Yeong Jun Koh

Video Object Segmentation (VOS) in low-light scenarios remains highly challenging due to significant texture loss and severe noise, which often lead to unreliable image feature generation and degraded segmentation performance. To address this issue, we propose EVOLVE, a novel event-guided deformable feature transfer and dual-memory refinement framework for low-light VOS. EVOLVE addresses spatial misalignment between frames and improves object representation by utilizing event-driven cues. The event-guided deformable feature transfer (EDFT) module enhances feature alignment through event-driven deformable convolutions, where offsets derived from event features enable motion-aware spatial adjustments, leading to more precise propagation of object features in reference frames. Furthermore, the dual-memory object transformer (DMOT) iteratively refines object representations by maintaining and updating both image-based and event-based memory representations. Through its memory refinement module (MRM), DMOT selectively enhances relevant object features while suppressing background noise, resulting in stable and temporally coherent segmentation results. Extensive experiments on low-light VOS benchmarks demonstrate that EVOLVE achieves state-of-the-art segmentation performance, surpassing both event-based and image-based VOS methods in accuracy and computational efficiency.

Wed 22 Oct. 14:15 - 16:15 PDT

#120
PatchScaler: An Efficient Patch-Independent Diffusion Model for Image Super-Resolution

Yong Liu · Hang Dong · Jinshan Pan · Qingji dong · Kai Chen · Rongxiang Zhang · Lean Fu · Fei Wang

While diffusion models significantly improve the perceptual quality of super-resolved images, they usually require a large number of sampling steps, resulting in high computational costs and long inference times. Recent efforts have explored reasonable acceleration schemes by reducing the number of sampling steps. However, these approaches treat all regions of the image equally, overlooking the fact that regions with varying levels of reconstruction difficulty require different sampling steps. To address this limitation, we propose PatchScaler, an efficient patch-independent diffusion pipeline for single image super-resolution. Specifically, PatchScaler introduces a Patch-adaptive Group Sampling (PGS) strategy that groups feature patches by quantifying their reconstruction difficulty and establishes shortcut paths with different sampling configurations for each group. To further optimize the patch-level reconstruction process of PGS, we propose a texture prompt that provides rich texture conditional information to the diffusion model. The texture prompt adaptively retrieves texture priors for the target patch from a common reference texture memory. Extensive experiments show that our PatchScaler achieves favorable performance in both quantitative and qualitative evaluations, while significantly speeding up inference.

Wed 22 Oct. 14:15 - 16:15 PDT

#121
OneGT: One-Shot Geometry-Texture Neural Rendering for Head Avatars

Jinshu Chen · Bingchuan Li · Fan Zhang · Songtao Zhao · Qian HE

Existing solutions for creating high-fidelity digital head avatars encounter various obstacles. Traditional rendering tools offer realistic results, while heavily requiring expert skills. Neural rendering methods are more efficient but often compromise between the generated fidelity and flexibility. We present OneGT that, for the first time, adheres to the frameworks of the rendering tools, while restructuring individual stages of the rendering pipeline through neural networks. OneGT maintains high systemic interpretability, inheriting the superior performances of neural rendering approaches. Specifically, OneGT contains a skeleton-anchoring stage and a texture-rendering stage, in which well-designed Transformers learn the geometric transformations and the proposed reference-perceptible DiT renders the textures respectively. Our framework learns geometric consistency from the innovatively introduced synthetic data, thus achieving superior performance while requiring only 10%-30% of the real-world data typically used by competitive methods. Experimental results demonstrate that OneGT achieves high fidelity in producing portrait avatars, meanwhile maintaining the flexibility of editing.

Wed 22 Oct. 14:15 - 16:15 PDT

#122
MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent

Xinyao Liao · Xianfang Zeng · Liao Wang · Gang YU · Guosheng Lin · Chi Zhang

We propose MotionAgent, enabling fine-grained motion control for text-guided image-to-video generation. The key technique is the motion field agent that converts motion information in text prompts into explicit motion fields, providing flexible and precise motion guidance. Specifically, the agent extracts the object movement and camera motion described in the text, and converts them into object trajectories and camera extrinsics, respectively. An analytical optical flow composition module integrates these motion representations in 3D space and projects them into a unified optical flow. An optical flow adapter takes the flow to control the base image-to-video diffusion model for generating fine-grained controlled videos. After that, an optional rethinking step can be adopted to ensure the generated video is aligned well with motion information in the prompt. The significant improvement in the Video-Text Camera Motion metrics on VBench indicates that our method achieves precise control over camera motion. We further construct a subset of VBench to evaluate the alignment of motion information in the text and the generated video, outperforming other advanced models on motion generation accuracy.

Wed 22 Oct. 14:15 - 16:15 PDT

#123
D2ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition

Wenjie Pei · Qizhong Tan · Guangming Lu · Jiandong Tian · Jun Yu

Adapting pre-trained image models to video modality has proven to be an effective strategy for robust few-shot action recognition. In this work, we explore the potential of adapter tuning in image-to-video model adaptation and propose a novel video adapter tuning framework, called Disentangled-and-Deformable Spatio-Temporal Adapter (D$^2$ST-Adapter). It features a lightweight design, low adaptation overhead and powerful spatio-temporal feature adaptation capabilities. D$^2$ST-Adapter is structured with an internal dual-pathway architecture that enables built-in disentangled encoding of spatial and temporal features within the adapter, seamlessly integrating into the single-stream feature learning framework of pre-trained image models. In particular, we develop an efficient yet effective implementation of the D$^2$ST-Adapter, incorporating the specially devised anisotropic Deformable Spatio-Temporal Attention as its pivotal operation. This mechanism can be individually tailored for two pathways with anisotropic sampling densities along the spatial and temporal domains in 3D spatio-temporal space, enabling disentangled encoding of spatial and temporal features while maintaining a lightweight design. Extensive experiments by instantiating our method on both pre-trained ResNet and ViT demonstrate the superiority of our method over state-of-the-art methods. Our method is particularly well-suited to challenging scenarios where temporal dynamics are critical for action recognition. Code will be released.

Wed 22 Oct. 14:15 - 16:15 PDT

#124
Highlight
Disentangled Clothed Avatar Generation with Layered Representation

Weitian Zhang · Yichao Yan · Sijing Wu · Manwen Liao · Xiaokang Yang

Clothed avatar generation has wide applications in virtual and augmented reality, filmmaking, and more. While existing methods have made progress in creating animatable digital avatars, generating avatars with disentangled components (e.g., body, hair, and clothes) has long been a challenge. In this paper, we propose LayerAvatar, a novel feed-forward diffusion-based method capable of generating high-quality component-disentangled clothed avatars in seconds. We propose a layered UV feature plane representation, where components are distributed in different layers of the Gaussian-based UV feature plane with corresponding semantic labels. This representation can be effectively learned with current feed-forward generation pipelines, facilitating component disentanglement and enhancing details of generated avatars. Based on the well-designed representation, we train a single-stage diffusion model and introduce constrain terms to mitigate the severe occlusion issue of the innermost human body layer. Extensive experiments demonstrate the superior performances of our method in generating highly detailed and disentangled clothed avatars. In addition, we explore its applications in component transfer.

Wed 22 Oct. 14:15 - 16:15 PDT

#125
Augmented Mass-Spring Model for Real-Time Dense Hair Simulation

Jorge Herrera · Yi Zhou · Xin Sun · Zhixin Shu · Chengan He · Soren Pirk · Dominik Michels

We propose a novel Augmented Mass-Spring (AMS) model for real-time simulation of dense hair at the strand level. Our approach considers the traditional edge, bending, and torsional degrees of freedom in mass-spring systems, but incorporates an additional one-way biphasic coupling with a ghost rest-shape configuration. Through multiple evaluation experiments with varied dynamical settings, we show that AMS improves the stability of the simulation in comparison to mass-spring discretizations, preserves global features, and enables the simulation of non-Hookean effects. Using a heptadiagonal decomposition of the resulting matrix, our approach provides the efficiency advantages of mass-spring systems over more complex constitutive hair models, while enabling a more robust simulation of multiple strand configurations. Finally, our results demonstrate that our framework enables the generation, complex interactivity, and editing of simulation-ready dense hair assets in real time.

Wed 22 Oct. 14:15 - 16:15 PDT

#126
Punching Bag vs. Punching Person: Motion Transferability in Videos

Raiyaan Abdullah · Jared Claypoole · Michael Cogswell · Ajay Divakaran · Yogesh Rawat

Action recognition models, both unimodal and multimodal, have demonstrated strong generalization in tasks such as zero-shot learning, base-to-novel transfer, and domain adaptation. However, can they effectively transfer high-level motion concepts across diverse contexts, even within similar distributions? For example, can a model recognize the broad action "Pushing" when presented with unknown variations such as "Pushing something from right to left"? To explore this, we introduce a motion transferability framework with three datasets: (1) Syn-TA, a synthetic dataset with 3D object motions; (2) Kinetics400-TA; and (3) Something-Something-v2-TA, both adapted from natural video datasets. We evaluate 13 state-of-the-art models on these benchmarks and observe a significant drop in performance when recognizing high-level actions in novel contexts. Our analysis reveals: 1) Multimodal models struggle more with fine-grained unknown actions than coarse ones; 2) The bias-free Syn-TA proves as challenging as real-world datasets, with models showing greater performance drops in controlled settings; 3) Larger models improve transferability when spatial cues dominate but struggle with intensive temporal reasoning, while reliance on object and background cues hinders generalization. We further explore how disentangling coarse and fine motions can improve recognition in temporally challenging datasets. Our study establishes a crucial benchmark for assessing motion transferability in action recognition.

Wed 22 Oct. 14:15 - 16:15 PDT

#127
OCK: Unsupervised Dynamic Video Prediction with Object-Centric Kinematics

YeonJi Song · Jaein Kim · Suhyung Choi · Jin-Hwa Kim · Byoung-Tak Zhang

Human perception involves decomposing complex multi-object scenes into time-static object appearance (i.e., size, shape, color) and time-varying object motion (i.e., position, velocity, acceleration). For machines to achieve human-like intelligence in real-world interactions, understanding these physical properties of objects is essential, forming the foundation for dynamic video prediction. While recent advancements in object-centric transformers have demonstrated potential in video prediction, they primarily focus on object appearance, often overlooking motion dynamics, which is crucial for modeling dynamic interactions and maintaining temporal consistency in complex environments. To address these limitations, we propose OCK, a dynamic video prediction model leveraging object-centric kinematics and object slots. We introduce a novel component named Object Kinematics that comprises explicit object motions, serving as an additional attribute beyond conventional appearance features to model dynamic scenes. The Object Kinematics are integrated into various OCK mechanisms, enabling spatiotemporal prediction of complex object interactions over long video sequences. Our model demonstrates superior performance in handling complex scenes with intricate object attributes and motions, highlighting its potential for applicability in vision-related dynamics learning tasks.

Wed 22 Oct. 14:15 - 16:15 PDT

#128
FaceXFormer: A Unified Transformer for Facial Analysis

Kartik Narayan · Vibashan VS · Rama Chellappa · Vishal Patel

In this work, we introduce FaceXFormer, an end-to-end unified transformer model capable of performing ten facial analysis tasks within a single framework. These tasks include face parsing, landmark detection, head pose estimation, attribute prediction, age, gender, and race estimation, facial expression recognition, face recognition, and face visibility. Traditional face analysis approaches rely on task-specific architectures and pre-processing techniques, limiting scalability and integration. In contrast, FaceXFormer employs a transformer-based encoder-decoder architecture, where each task is represented as a learnable token, enabling seamless multi-task processing within a unified model. To enhance efficiency, we introduce FaceX, a lightweight decoder with a novel bi-directional cross-attention mechanism, which jointly processes face and task tokens to learn robust and generalized facial representations. We train FaceXFormer on ten diverse face perception datasets and evaluate it against both specialized and multi-task models across multiple benchmarks, demonstrating state-of-the-art or competitive performance. Additionally, we analyze the impact of various components of FaceXFormer on performance, assess real-world robustness in "in-the-wild" settings, and conduct a computational performance evaluation. To the best of our knowledge, FaceXFormer is the first model capable of handling ten facial analysis tasks while maintaining real-time performance at $33.21$ FPS. Code and models will be released post-review.

Wed 22 Oct. 14:15 - 16:15 PDT

#129
ContextFace: Generating Facial Expressions from Emotional Contexts

minjung kim · Minsang Kim · Seung Jun Baek

The task of generating 3D facial expressions given various situational contexts is important for applications such as virtual avatars or human-robot interactions. The task is, however, challenging not only because it requires a comprehensive understanding of emotion, expression and contexts, but also there rarely are datasets to support the task. We propose ContextFace, a Multi-modal Large Language Model (MLLM) fine-tuned to generate 3D facial expressions depending on complex situational contexts. To overcome the lack of datasets, we perform a context augmentation to existing emotion recognition datasets; we generate plausible situations and quotes from images and emotions to annotate the dataset. Next, we perform visual instruction tuning of MLLMs on context-augmented datasets to boost its capability of visual synthesis from emotions. Experiments show a superior performance of ContextFace in the zero-shot evaluation of contextual emotion recognition. A qualitative evaluation shows that our method generates expressions consistent with diverse contexts and performs complex emotion reasoning, e.g., speculative generation of expressions of occluded faces through interactive prompting.

Wed 22 Oct. 14:15 - 16:15 PDT

#130
Laboring on less labors: RPCA Paradigm for Pan-sharpening

honghui xu · Chuangjie Fang · Yibin Wang · Jie Wu · Jianwei Zheng

Deep unfolding network (DUN) based pansharpening has shed new light on high-resolution/spectrum image acquisition, serving as a computational alternative to physical devices. While with both merits of deep feature learning and acceptable interpretability enjoyed, current pansharpening necessitates substantial effort in approximating the degradation matrices along the spatial and spectral dimensions, yet with performance hardly guaranteed within the complex scenarios. Moreover, as a key step during DUN update, current solutions rely solely on black-box networks to learn the data-driven priors, which further results in laborious architecture crafting and compromised interpretability. To counteract the dilemmas, we propose a new solution, namely \textbf{R}PCA-based \textbf{U}nfolding \textbf{N}etwork (RUN), which shrinks the original two degradations to only one. Specifically, grounded in the significant sparsity of spatial offset components, \textit{i.e.}, the difference between upsampled image and the desired target, we shift the original pansharpening issue into a novel Robust Principal Component Analysis (RPCA)-based paradigm. On that basis, the tricky approximation to the spatial degradation matrix as well as its transposed counterpart is naturally avoided. Specific for the prior learning step of RPCA unfolding, an efficient Nonlinear transformation-based Tensor Nuclear Norm (NTNN) is meticulously engineered, in which the computationally intensive Singular Value Decomposition is avoided with the aid of depthwise convolutions. More importantly, NTNN plays a plug-and-play role and can be easily embedded into Transformer/CNN architectures for the learning of both global and local features. Experimental results on multiple remote datasets demonstrate the superiority of the proposal over previous SOTA methods. Representatively, with two formerly indispensable degradations omitted, a 0.899dB PSNR gain can still be achieved on the GF2 dataset.

Wed 22 Oct. 14:15 - 16:15 PDT

#131
ShadowHack: Hacking Shadows via Luminance-Color Divide and Conquer

Jin Hu · Mingjia Li · Xiaojie Guo

Shadows introduce challenges such as reduced brightness, texture deterioration, and color distortion in images, complicating a holistic solution. This study presents ShadowHack, a divide-and-conquer strategy that tackles these complexities by decomposing the original task into luminance recovery and color remedy. To brighten shadow regions and repair the corrupted textures in the luminance space, we customize LRNet, a U-shaped network with a rectified outreach attention module, to enhance information interaction and recalibrate contaminated attention maps. With luminance recovered, CRNet then leverages cross-attention mechanisms to revive vibrant colors, producing visually compelling results. Extensive experiments on multiple datasets are conducted to demonstrate the superiority of ShadowHack over existing state-of-the-art solutions both quantitatively and qualitatively, highlighting the effectiveness of our design. Our code will be made publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#132
What we need is explicit controllability: Training 3D gaze estimator using only facial images

Tingwei Li · Jun Bao · Zhenzhong Kuang · Buyu Liu

This work focuses on unsupervised 3D gaze estimation. Specifically, we adopt a learning-by-synthesis approach, where a gaze prediction model is trained using simulated data. Unlike existing methods that lack explicit and accurate control over facial images—particularly the eye regions—we propose a geometrically meaningful 3D representation that enables diverse, precise, and explicit control over illumination, eye regions, and gaze targets using only facial images. Given a sequence of facial images, our method constructs a mesh representation where each mesh is associated with 3D Gaussians, allowing for explicit lighting control. To further enhance realism, we introduce eye-focused constraints, including a rotation symmetry protocol, as well as geometry and appearance losses for the eye regions, alongside conventional learning objectives. Additionally, we incorporate a virtual screen target and rotate the eyeballs accordingly, generating more accurate pseudo gaze directions paired with realistic facial images. We validate our approach through extensive experiments on three benchmarks. The results demonstrate that gaze estimators trained using our method outperform all unsupervised baselines and achieve performance comparable to cross-dataset approaches. Furthermore, our method generates the most visually realistic images, as confirmed by both objective and subjective image quality metrics.

Wed 22 Oct. 14:15 - 16:15 PDT

#133
Highlight
Riemannian-Geometric Fingerprints of Generative Models

Hae Jin Song · Laurent Itti

Recent breakthroughs and rapid integration of generative models (GMs) have sparked interest in the problem of model attribution and their fingerprints. For instance, service providers need reliable methods of authenticating their models to protect their IP, while users and law enforcement seek to verify the source of generated content for accountability and trust. In addition, a growing threat of model collapse is arising, as more model-generated data are being fed back into sources (e.g., YouTube) that are often harvested for training (``regurgitative training''), heightening the need to differentiate synthetic from human data. Yet, a gap still exists in understanding generative models' fingerprints, we believe, stemming from the lack of a formal framework that can define, represent, and analyze the fingerprints in a principled way. To address this gap, we take a geometric approach and propose a new definition of artifact and fingerprint of generative models using Riemannian geometry, which allows us to leverage the rich theory of differential geometry. Our new definition generalizes previous work (Song et al, 2024) to non-Euclidean manifolds by learning Riemannian metrics from data and replacing the Euclidean distances and nearest-neighbor search with geodesic distances and $k$NN-based Riemannian center of mass. We apply our theory to a new gradient-based algorithm for computing the fingerprints in practice. Results show that it is more effective in distinguishing a large array of generative models, spanning across 4 different datasets in 2 different resolutions (64x64, 256x256), 27 model architectures, and 2 modalities (Vision, Vision-Language). Using our proposed definition can significantly improve the performance on model attribution, as well as a generalization to unseen datasets, model types, and modalities, suggesting its efficacy in practice.

Wed 22 Oct. 14:15 - 16:15 PDT

#134
Unraveling the Smoothness Properties of Diffusion Models: A Gaussian Mixture Perspective

Yingyu Liang · Zhizhou Sha · Zhenmei Shi · Zhao Song · Mingda Wan · Yufa Zhou

Diffusion models have made rapid progress in generating high-quality samples across various domains. However, a theoretical understanding of the Lipschitz continuity and second momentum properties of the diffusion process is still lacking. In this paper, we bridge this gap by providing a detailed examination of these smoothness properties for the case where the target data distribution is a mixture of Gaussians, which serves as a universal approximator for smooth densities such as image data. We prove that if the target distribution is a $k$-mixture of Gaussians, the density of the entire diffusion process will also be a $k$-mixture of Gaussians. We then derive tight upper bounds on the Lipschitz constant and second momentum that are independent of the number of mixture components $k$. Finally, we apply our analysis to various diffusion solvers, both SDE and ODE based, to establish concrete error guarantees in terms of the total variation distance and KL divergence between the target and learned distributions. Furthermore, our preliminary experiments support our theoretical analysis. Our results provide deeper theoretical insights into the dynamics of the diffusion process under common data distributions.

Wed 22 Oct. 14:15 - 16:15 PDT

#135
G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation

Juntao Jian · Xiuping Liu · Zixuanchen Zixuanchen · Manyi Li · Jian Liu · Ruizhen Hu

Recent advances in dexterous grasping synthesis have demonstrated significant progress in producing reasonable and plausible grasps for many task purposes. But it remains challenging to generalize to unseen object categories and diverse task instructions. In this paper, we propose G-DexGrasp, a retrieval-augmented generation approach that can produce high-quality dexterous hand configurations for unseen object categories and language-based task instructions. The key is to retrieve generalizable grasping priors, including the fine-grained contact part and the affordance-related distribution of relevant grasping instances, for the following synthesis pipeline. Specifically, the fine-grained contact part and affordance act as generalizable guidance to infer reasonable grasping configurations for unseen objects with a generative model, while the relevant grasping distribution plays as regularization to guarantee the plausibility of synthesized grasps during the subsequent refinement optimization. Our comparison experiments validate the effectiveness of our key designs for generalization and demonstrate the remarkable performance against the existing approaches.

Wed 22 Oct. 14:15 - 16:15 PDT

#136
Highlight
FPEM: Face Prior Enhanced Facial Attractiveness Prediction for Live Videos with Face Retouching

Hui Li · Xiaoyu Ren · Hongjiu Yu · Ying Chen · Kai Li · L Wang · Xiongkuo Min · Huiyu Duan · Guangtao Zhai · Xu Liu

Facial attractiveness prediction (FAP) has long been an important computer vision task, which could be widely applied in live videos with facial retouching. However, previous FAP datasets are either small or closed-source. Moreover, the corresponding FAP models exhibit limited generalization and adaptation ability.To overcome these limitations, we introduce the first large-scale FAP dataset LiveBeauty specifically designed for live video scenarios wherein face images may be real-time processed for aesthetics purposes.10,000 face images are collected directly from a live streaming platform, with 200,000 corresponding attractiveness annotations obtained from a well-devised subjective experiment, making LiveBeauty the largest open-access FAP dataset. Based on the built dataset, a novel FAP method named Facial Prior Enhanced Multi-modal model (FPEM) is proposed to measure the attractiveness of facial images.Extensive experiments conducted on both LiveBeauty and other open-source FAP datasets demonstrate that our proposed method achieves state-of-the-art performance. The dataset will be available soon.

Wed 22 Oct. 14:15 - 16:15 PDT

#137
Diffusion-Based Imaginative Coordination for Bimanual Manipulation

Huilin Xu · Jian Ding · Jiakun Xu · Ruixiang Wang · Jun Chen · Jinjie Mai · Yanwei Fu · Bernard Ghanem · Feng Xu · Mohamed Elhoseiny

Bimanual manipulation is crucial in robotics, enabling complex tasks in industrial automation and household services. However, it poses significant challenges due to the high-dimensional action space and intricate coordination requirements. While video prediction has been recently studied for representation learning and control, leveraging its ability to capture rich dynamic and behavioral information, its potential for enhancing bimanual coordination remains underexplored. To bridge this gap, we propose a unified diffusion-based framework for the joint optimization of video and action prediction. Specifically, we propose a multi-frame latent prediction strategy that encodes future states in a compressed latent space, preserving task-relevant features. Furthermore, we introduce a unidirectional attention mechanism where video prediction is conditioned on the action, but action prediction remains independent of video prediction. This design allows us to omit video prediction during inference, significantly enhancing efficiency. Experiments on two simulated benchmarks and a real-world setting demonstrate a significant improvement in the success rate over the strong baseline ACT using our method, achieving a 24.9% increase on ALOHA, an 11.1% increase on RoboTwin, and a 32.5% increase in real-world experiments.

Wed 22 Oct. 14:15 - 16:15 PDT

#138
WarpHE4D: Dense 4D Head Map toward Full Head Reconstruction

Jongseob Yun · Yong-Hoon Kwon · Min-Gyu Park · Ju-Mi Kang · Min-Ho Lee · Inho Chang · Ju Yoon · Kuk-Jin Yoon

We address the 3D head reconstruction problem and the facial correspondence search problem in a unified framework, named as $\textbf{WarpHE4D}$. The underlying idea is to establish correspondences between the facial image and the fixed UV texture map by exploiting powerful self-supervised visual representations, $\textit{i.e.}$, DINOv2. In other words, we predict UV coordinates for each pixel that maps the pixel to a point in the UV map. At the same time, we predict the nose-centered depth map leveraged by the facial correspondences. Note that our framework does not require fitting a template model, $\text{e.g.,}$ 3DMM, to the image, which directly regresses 4D vectors for each pixel. The experimental results show that our approach not only improves the accuracy of head geometry but also significantly improves the robustness under pose or viewpoint variations, particularly when the head is rotated more than 90 degrees. We believe our method can be a groundwork for photorealistic head avatar generation, even in uncalibrated camera settings.

Wed 22 Oct. 14:15 - 16:15 PDT

#139
PrimHOI: Compositional Human-Object Interaction via Reusable Primitives

Kai Jia · Tengyu Liu · Mingtao Pei · Yixin Zhu · Siyuan Huang

Synthesizing complex and diverse human-object interactions (HOI) based on minimal instructions is crucial for advancing character animation and embodied AI. Existing approaches primarily rely on data-intensive learning models, which struggle to replicate the nuanced, compositional structure of daily HOI motions. In this paper, we propose a novel framework that leverages a generalizable representation of HOI primitives defined by relative geometry. Our approach uses an object-centric hierarchical planning process, integrating high-level planning, key pose generation, and intermediate motion synthesis to construct realistic HOI sequences achieving novel tasks. Key poses, defined by reusable contact mode primitives, serve as flexible constraints that guide the synthesis of intricate interaction motions through a symbolic planner. Our system generates intermediate motions by first planning object trajectories with collision avoidance, followed by object-motion-guided human motion generation. To ensure coherence and realism, we apply a post-optimization process that aligns motions with planned constraints, resulting in high-quality interaction sequences. Our framework supports zero-shot transfer, enabling the synthesis of novel HOI motions without specific training examples. Experimental results demonstrate that our approach significantly enhances the adaptability, diversity, and quality of synthesized interactions, marking a meaningful step forward in flexible HOI motion generation.

Wed 22 Oct. 14:15 - 16:15 PDT

#140
Continuous-Time Human Motion Field from Event Cameras

Ziyun (Claude) Wang · Ruijun Zhang · Zi-Yan Liu · Yufu Wang · Kostas Daniilidis

This paper addresses the challenges of estimating a continuous-time field from a stream of events. Existing Human Mesh Recovery (HMR) methods rely predominantly on frame-based approaches, which are prone to aliasing and inaccuracies due to limited temporal resolution and motion blur. In this work, we predict a continuous-time human motion field from events caused by human motion. Prior state-of-the-art methods rely on computationally intensive optimization across a fixed number of poses at high frame rates, which becomes prohibitively expensive as we increase the temporal resolution. In comparison, our model leverages a recurrent feed-forward neural network to predict human motion in the latent space of possible human motions. We present the first work that replaces traditional event volume-based discrete-time pre-dictions with a continuous human motion field represented as a time-implicit function, enabling parallel pose queries at arbitrary temporal resolutions. To advance the evaluation of continuous-time human pose estimation, we introduce the Beam-splitter Event Agile Human Motion Dataset—a hardware-synchronized high-speed human dataset tailored for this purpose. EvHuman improves joint errors by 23.8 % compared to previous event human methods, while reducing the computational time by 69%.

Wed 22 Oct. 14:15 - 16:15 PDT

#141
Efficient Track Anything

Yunyang Xiong · Chong Zhou · Xiaoyu Xiang · Lemeng Wu · Chenchen Zhu · Zechun Liu · Saksham Suri · Balakrishnan Varadarajan · Ramya Akula · Forrest Iandola · Raghuraman Krishnamoorthi · Bilge Soran · Vikas Chandra

Segment Anything Model 2 (SAM 2) has emerged as a powerful tool for video object segmentation and tracking anything. Key components of SAM 2 that drive the impressive video object segmentation performance include a large multistage image encoder for frame feature extraction and a memory mechanism that stores memory contexts from past frames to help current frame segmentation. The high computation complexity of image encoder and memory module has limited its applications in real-world tasks, e.g., video object segmentation on mobile devices. To address this limitation, we propose EfficientTAMs, lightweight end-to-end track anything models that produce high-quality results with low latency and small model size. Our idea is based on adopting lightweight Vision Transformer (ViT) as an image encoder for video object segmentation, and introducing an efficient memory module, which reduces the complexity for both frame feature extraction and memory computation for current frame segmentation. We take vanilla lightweight ViTs and efficient memory module to build EfficientTAMs, and train the models on SA-1B and SA-V datasets for video object segmentation and track anything tasks. We evaluate on multiple video segmentation benchmarks including semi-supervised VOS and promptable video segmentation, and find that our proposed EfficientTAM with lightweight ViT performs comparably to SAM 2 model (SAM 2-HieraB+) with~1.6x speedup on A100 and ~2.4x parameter reduction. On segment anything image tasks, our EfficientTAMs also perform favorably over original SAM with ~20x speedup on A100 and ~20x parameter reduction. On mobile devices such as iPhone 15 Pro Max, our EfficientTAM can run at ~28 FPS for near real-time video object segmentation with reasonable quality, highlighting the capability of small models for on-device video object segmentation applications.

Wed 22 Oct. 14:15 - 16:15 PDT

#142
HAMoBE: Hierarchical and Adaptive Mixture of Biometric Experts for Video-based Person ReID

Yiyang Su · Yunping Shi · Feng Liu · Xiaoming Liu

Recently, research interest in person re-identification (ReID) has increasingly focused on video-based scenarios, essential for robust surveillance and security in varied and dynamic environments. However, existing video-based ReID methods often overlook the necessity of identifying and selecting the most discriminative features from both videos in a query-gallery pair for effective matching. To address this challenge, we propose a novel Hierarchical and Adaptive Mixture of Biometric Experts (HAMoBE) framework, which leverages multi-scale features from a pre-trained large model (\emph{e.g.}, CLIP) and is designed to mimic human perceptual mechanisms by independently modeling key biometric features—appearance, Static body shape, and dynamic gait—and adaptively integrating them. Specifically, HAMoBE includes two levels: the first level extracts low-level features from multi-scale representations provided by the frozen large model, while the second level consists of specialized experts focusing on long-term, short-term, and temporal features. To ensure robust matching, we introduce a new dual-input decision gating network that dynamically adjusts the contributions of each expert based on their relevance to the input scenarios. Extensive evaluations on benchmarks like MEVID demonstrate that our approach yields significant performance improvements (+$11.0\%$ Rank1).

Wed 22 Oct. 14:15 - 16:15 PDT

#143
Multi-Object Sketch Animation by Scene Decomposition and Motion Planning

Jingyu Liu · Zijie Xin · Yuhan Fu · Ruixiang Zhao · Bangxiang Lan · Xirong Li

Sketch animation, which brings static sketches to life by generating dynamic video sequences, has found widespread applications in GIF design, cartoon production, and daily entertainment. While current sketch animation methods perform well in single-object sketch animation, they struggle in multi-object scenarios. By analyzing their failures, we summarize two challenges of transitioning from single-object to multi-object sketch animation: object-aware motion modeling and complex motion optimization. For multi-object sketch animation, we propose MoSketch based on iterative optimization through Score Distillation Sampling (SDS), without any other data for training. We propose four modules: LLM-based scene decomposition, LLM-based motion planning, motion refinement network and compositional SDS, to tackle the two challenges in a divide-and-conquer strategy. Extensive qualitative and quantitative experiments demonstrate the superiority of our method over existing sketch animation approaches. MoSketch takes a pioneering step towards multi-object sketch animation, opening new avenues for future research and applications. The code will be released.

Wed 22 Oct. 14:15 - 16:15 PDT

#144
Highlight
ISP2HRNet: Learning to Reconstruct High Resolution Image from Irregularly Sampled Pixels via Hierarchical Gradient Learning

Yuanlin Wang · Ruiqin Xiong · Rui Zhao · Jin Wang · Xiaopeng Fan · Tiejun Huang

While image signals are typically defined on a regular 2D grid, there are scenarios where they are only available at irregular positions. In such cases, reconstructing a complete image on regular grid is essential. This paper introduces ISP2HRNet, an end-to-end network designed to reconstruct high resolution image from irregularly sampled pixels that do not fall on a regular grid. To handle the challenges brought by irregular sampling, we propose an architecture to extract gradient structure hierarchically and learn continuous image representation. Specifically, we derive image gradient for each irregularly sampled pixel and further learn higher order gradient structural features according to the geometric and photometric information at the vertices of neighboring triangles. To convert the features from irregular pixels to regular grid, we propose a dual branch content-dependent weight generator to adaptively fuse the information from neighboring irregular pixels. Subsequently, an encoder captures deep structural details on regular grid and forms latent codes. Implicit neural representation parameterized by multi-layer perceptron decodes the latent codes and coordinates to pixel values for generating high resolution image. Experimental results demonstrate that the proposed network can effectively solve the problem of high resolution image reconstruction from irregularly sampled pixels and achieve promising results. The code will be made publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#145
LDIP: Long Distance Information Propagation for Video Super-Resolution

Michael Bernasconi · Abdelaziz Djelouah · Yang Zhang · Markus Gross · Christopher Schroers

Video super-resolution (VSR) methods typically exploit information across multiple frames to achieve high quality upscaling, with recent approaches demonstrating impressive performance. Nevertheless, challenges remain, particularly in effectively leveraging information over long distances. To address this limitation in VSR, we propose a strategy for long distance information propagation with a flexible fusion module that can optionally also assimilate information from additional high resolution reference images. We design our overall approach such that it can leverage existing pre-trained VSR backbones and adapt the feature upscaling module to support arbitrary scaling factors. Our experiments demonstrate that we can achieve state-of-the-art results on perceptual metrics and deliver more visually pleasing results compared to existing solutions.

Wed 22 Oct. 14:15 - 16:15 PDT

#146
MBTI: Masked Blending Transformers with Implicit Positional Encoding for Frame-rate Agnostic Motion Estimation

Jungwoo Huh · Yeseung Park · Seongjean Kim · Jungsu Kim · Sanghoon Lee

Human motion estimation models typically assume a fixed number of input frames, making them sensitive to variations in frame rate and leading to inconsistent motion predictions across different temporal resolutions. This limitation arises because input frame rates inherently determine the temporal granularity of motion capture, causing discrepancies when models trained on a specific frame rate encounter different sampling frequencies. To address this challenge, we propose MBTI (Masked Blending Transformers with Implicit Positional Encoding), a frame rate-agnostic human motion estimation framework designed to maintain temporal consistency across varying input frame rates. Our approach leverages a masked autoencoder (MAE) architecture with masked token blending, which aligns input tokens with a predefined high-reference frame rate, ensuring a standardized temporal representation. Additionally, we introduce implicit positional encoding, which encodes absolute time information using neural implicit functions, enabling more natural motion reconstruction beyond discrete sequence indexing. By reconstructing motion at a high reference frame rate and optional downsampling, MBTI ensures both frame rate generalization and temporal consistency. To comprehensively evaluate MBTI, we introduce EMDB-FPS, an augmented benchmark designed to assess motion estimation robustness across multiple frame rates in both local and global motion estimation tasks. To further assess MBTI’s robustness, we introduce the Motion Consistency across Frame rates (MCF), a novel metric to quantify the deviation of motion predictions across different input frame rates. Our results demonstrate that MBTI outperforms state-of-the-art methods in both motion accuracy and temporal consistency, achieving the most stable and consistent motion predictions across varying frame rates.

Wed 22 Oct. 14:15 - 16:15 PDT

#147
Highlight
Sequential keypoint density estimator: an overlooked baseline of skeleton-based video anomaly detection

Anja Delić · Matej Grcic · Siniša Šegvić

Detecting anomalous human behaviouris an important visual taskin safety-critical applicationssuch as healthcare monitoring,workplace safety,or public surveillance.In these contexts,abnormalities are often reflectedwith unusual human poses.Thus, we propose SeeKer,a method for detecting anomaliesin sequences of human skeletons.Our method formulates the skeleton sequence densitythrough autoregressive factorization at the keypoint level.The corresponding conditional distributionsrepresent probable keypoint locations given prior skeletal motion.We formulate the joint distribution of the considered skeletonas causal prediction of conditional Gaussiansacross its constituent keypoints.A skeleton is flagged as anomalous if its keypoint locations surprise our model(i.e. receive a low density).In practice, our anomaly score is a weighted sum of per-keypoint log-conditionals,where the weights account for the confidence of the underlying keypoint detector.Despite its conceptual simplicity,SeeKer surpasses all previous methodson the UBnormal and MSAD-HR datasetswhile delivering competitive performanceon the ShanghaiTech dataset.

Wed 22 Oct. 14:15 - 16:15 PDT

#148
MaskControl: Spatio-Temporal Control for Masked Motion Synthesis

Ekkasit Pinyoanuntapong · Muhammad Usama Saleem · Korrawe Karunratanakul · Pu Wang · Hongfei Xue · Chen Chen · chuan guo · Junli Cao · Jian Ren · Sergey Tulyakov

Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, these models struggle to achieve high-precision control while maintaining high-quality motion generation. To address these challenges, we propose MaskControl, the first approach to introduce controllability to the generative masked motion model. Our approach introduces two key innovations. First, \textit{Logits Regularizer} implicitly perturbs logits at training time to align the distribution of motion tokens with the controlled joint positions, while regularizing the categorical token prediction to ensure high-fidelity generation. Second, \textit{Logit Optimization} explicitly optimizes the predicted logits during inference time, directly reshaping the token distribution that forces the generated motion to accurately align with the controlled joint positions. Moreover, we introduce \textit{Differentiable Expectation Sampling (DES)} to combat the non-differential distribution sampling process encountered by logits regularizer and optimization. Extensive experiments demonstrate that MaskControl outperforms state-of-the-art methods, achieving superior motion quality (FID decreases by ~77\%) and higher control precision (average error 0.91 vs. 1.08). Additionally, MaskControl enables diverse applications, including any-joint-any-frame control, body-part timeline control, and zero-shot objective control. Video visualization can be found at \url{https://anonymous-ai-agent.github.io/CAM}

Wed 22 Oct. 14:15 - 16:15 PDT

#149
RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model

Huiyang Hu · Peijin Wang · Hanbo Bi · Boyuan Tong · Zhaozhi Wang · Wenhui Diao · Hao Chang · Yingchao Feng · Ziqi Zhang · Yaowei Wang · Qixiang Ye · Kun Fu · Xian Sun

Remote sensing foundation models largely break away from the traditional paradigm of designing task-specific models, offering greater scalability across multiple tasks. However, they face challenges such as low computational efficiency and limited interpretability, especially when dealing with large-scale remote sensing images. To overcome these, we draw inspiration from heat conduction, a physical process modeling local heat diffusion. Building on this idea, we are the first to explore the potential of using the parallel computing model of heat conduction to simulate the local region correlations in high-resolution remote sensing images, and introduce RS-vHeat, an efficient multi-modal remote sensing foundation model. Specifically, RS-vHeat 1) applies the Heat Conduction Operator (HCO) with a complexity of $O(N^{1.5})$ and a global receptive field, reducing computational overhead while capturing remote sensing object structure information to guide heat diffusion; 2) learns the frequency distribution representations of various scenes through a self-supervised strategy based on frequency domain hierarchical masking and multi-domain reconstruction; 3) significantly improves efficiency and performance over state-of-the-art techniques across 4 tasks and 10 datasets. Compared to attention-based remote sensing foundation models, we reduce memory usage by 84\%, FLOPs by 24\% and improves throughput by 2.7 times. The code will be made publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#150
Highlight
GameFactory: Creating New Games with Generative Interactive Videos

Jiwen Yu · Yiran Qin · Xintao Wang · Pengfei Wan · Di ZHANG · Xihui Liu

Generative videos have the potential to revolutionize game development by autonomously creating new content. In this paper, we present GameFactory, a framework for action-controlled scene-generalizable game video generation. We first address the fundamental challenge of action controllability by introducing GF-Minecraft, a action-annotated game video dataset without human bias, and developing a action control module that enables precise control over both keyboard and mouse inputs. We further extend to support autoregressive generation for unlimited-length interactive videos.More importantly, GameFactory tackles the critical challenge of scene-generalizable action control, which most existing methods fail to address. To enable the creation of entirely new and diverse games beyond fixed styles and scenes, we leverage the open-domain generative priors from pre-trained video diffusion models. To bridge the domain gap between open-domain priors and small-scale game datasets, we propose a multi-phase training strategy with a domain adapter that decouples game style learning from action control. This decoupling ensures that action control learning is no longer bound to specific game styles, thereby achieving scene-generalizable action control.Experimental results demonstrate that GameFactory effectively generates open-domain action-controllable game videos, representing a significant step forward in AI-driven game generation. Our dataset and code will be publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#151
MOSCATO: Predicting Multiple Object State Change Through Actions

Parnian Zameni · Yuhan Shen · Ehsan Elhamifar

We introduce MOSCATO: a new benchmark for predicting the evolving states of multiple objects through long procedural videos with multiple actions. While prior work in object state prediction has typically focused on a single object undergoing one or a few state changes, real-world tasks require tracking many objects whose states evolve over multiple actions. Given the high cost of gathering framewise object-state labels for many videos, we develop a weakly-supervised multiple object state prediction framework, which only uses action labels during training. Specifically, we propose a novel Pseudo-Label Acquisition (PLA) pipeline that integrates large language models, vision–language models, and action segment annotations to generate fine-grained, per-frame object-state pseudo-labels for training a Multiple Object State Prediction (MOSP) network. We further devise a State–Action Interaction (SAI) module that explicitly models the correlations between actions and object states, thereby improving MOSP. To facilitate comprehensive evaluation, we create the MOSCATO benchmark b y augmenting three egocentric video datasets with framewise object-state annotations. Experiments show that our multi-stage pseudo-labeling approach and SAI module significantly boost performance over zero-shot VLM baselines and naive extensions of existing methods, underscoring the importance of holistic action–state modeling for fine-grained procedural video understanding.

Wed 22 Oct. 14:15 - 16:15 PDT

#152
FaceCraft4D: Animated 3D Facial Avatar Generation from a Single Image

Fei Yin · Mallikarjun Reddy · Chun-Han Yao · Rafal Mantiuk · Varun Jampani

We present a novel framework for generating high-quality, animatable 4D avatar from a single image. While recent advances have shown promising results in 4D avatar creation, existing methods either require extensive multiview data or struggle with geometry accuracy and identity consistency. To address these limitations, we propose a comprehensive system that leverages geometry, image, and video priors to create full-view, animatable avatars. Our approach first obtains initial coarse geometry through 3D-GAN inversion. Then, it enhances multiview textures using depth-guided warping signals for cross-view consistency with the help of the image diffusion model. To handle expression animation, we incorporate a video prior with synchronized driving signals across viewpoints. We further introduce a Consistent-Inconsistent training to effectively handle data inconsistencies during 4D reconstruction. Experimental results demonstrate that our method achieves superior quality compared to the prior art, while maintaining consistency across different viewpoints and expressions.

Wed 22 Oct. 14:15 - 16:15 PDT

#153
Decouple to Reconstruct: High Quality UHD Restoration via Active Feature Disentanglement and Reversible Fusion

Yidi Liu · Dong Li · Yuxin Ma · Jie Huang · Wenlong Zhang · Xueyang Fu · Zheng-Jun Zha

Ultra-high-definition (UHD) image restoration often faces computational bottlenecks and information loss due to its extremely high resolution. Existing studies based on Variational Autoencoders (VAE) improve efficiency by transferring the image restoration process from pixel space to latent space. However, degraded components are inherently coupled with background elements in degraded images, both information loss during compression and information gain during compensation remain uncontrollable. These lead to restored images often exhibiting image detail loss and incomplete degradation removal. To address this issue, we propose a Controlled Differential Disentangled VAE, which utilizes Hierarchical Contrastive Disentanglement Learning and an Orthogonal Gated Projection Module to guide the VAE to actively discard easily recoverable background information while encoding more difficult-to-recover degraded information into the latent space. Additionally, we design a Complex Invertible Multiscale Fusion Network to handle background features, ensuring their consistency, and utilize a latent space restoration network to transform the degraded latent features, leading to more accurate restoration results. Extensive experimental results demonstrate that our method effectively alleviates the information loss problem in VAE models while ensuring computational efficiency, significantly improving the quality of UHD image restoration, and achieves state-of-the-art results in six UHD restoration tasks with only 1M parameters.

Wed 22 Oct. 14:15 - 16:15 PDT

#154
MOVE: Motion-Guided Few-Shot Video Object Segmentation

Kaining Ying · Hengrui Hu · Henghui Ding

This work addresses motion-guided few-shot video object segmentation (FSVOS), which aims to segment dynamic objects in videos based on a few annotated examples with the same motion patterns. Existing FSVOS datasets and methods typically focus on object categories, which are static attributes that ignore the rich temporal dynamics in videos, limiting their application in scenarios requiring motion understanding. To fill this gap, we introduce MOVE, a large-scale dataset specifically designed for motion-guided FSVOS. Based on MOVE, we comprehensively evaluate 6 state-of-the-art methods from 3 different related areas across 2 experimental settings. Our results reveal that current methods struggle to address motion-guided FSVOS, prompting us to analyze the associated challenges and propose a baseline method, Decoupled Motion-Appearance Network (DMA). Experiments demonstrate that our approach achieves superior performance in few-shot motion understanding, establishing a solid foundation for future research in this direction.

Wed 22 Oct. 14:15 - 16:15 PDT

#155
V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video

Jianqi Chen · Biao Zhang · Xiangjun Tang · Peter Wonka

We present V2M4, a novel 4D reconstruction method that directly generates a usable 4D mesh animation asset from a single monocular video. Unlike existing approaches that rely on priors from multi-view image and video generation models, our method is based on native 3D mesh generation models. Naively applying 3D mesh generation models to generate a mesh for each frame in a 4D task can lead to issues such as incorrect mesh poses, misalignment of mesh appearance, and inconsistencies in mesh geometry and texture maps. To address these problems, we propose a structured workflow that includes camera search and mesh reposing, condition embedding optimization for mesh appearance refinement, pairwise mesh registration for topology consistency, and global texture map optimization for texture consistency. Our method outputs high-quality 4D animated assets that are compatible with mainstream graphics and game software. Experimental results across a variety of animation types and motion amplitudes demonstrate the generalization and effectiveness of our method. Please refer to our Supplementary Files for video displays.

Wed 22 Oct. 14:15 - 16:15 PDT

#156
Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene

Donggeun Lim · Jinseok Bae · Inwoo Hwang · Seungmin Lee · Hwanhee Lee · Young Min Kim

In this work, we propose a framework that creates a lively virtual dynamic scene with contextual motions of multiple humans. Generating multi-human contextual motion requires holistic reasoning over dynamic relationships among human-human and human-scene interactions. We adapt the power of a large language model (LLM) to digest the contextual complexity within textual input and convert the task into tangible subproblems such that we can generate multi-agent behavior beyond the scale that was not considered before. Specifically, our event generator formulates the temporal progression of a dynamic scene into a sequence of small events. Each event calls for a well-defined motion involving relevant characters and objects. Next, we synthesize the motions of characters at positions sampled based on spatial guidance. We employ a high-level module to deliver scalable yet comprehensive context, translating events into relative descriptions that enable the retrieval of precise coordinates. As the first to address this problem at scale and with diversity, we offer a benchmark to assess diverse aspects of contextual reasoning. Benchmark results and user studies show that our framework effectively captures scene context with high scalability.

Wed 22 Oct. 14:15 - 16:15 PDT

#157
EvolvingGrasp: Evolutionary Grasp Generation via Efficient Preference Alignment

Yufei Zhu · Yiming Zhong · Zemin Yang · Peishan Cong · Jingyi Yu · Xinge Zhu · Yuexin Ma

Dexterous robotic hands often struggle to generalize effectively in complex environments due to the limitations of models trained on low-diversity data. However, the real world presents an inherently unbounded range of scenarios, making it impractical to account for every possible variation. A natural solution is to enable robots learning from experience in complex environments—an approach akin to evolution, where systems improve through continuous feedback, learning from both failures and successes, and iterating toward optimal performance. Motivated by this, we propose EvolvingGrasp, an evolutionary grasp generation method that continuously enhances grasping performance through efficient preference alignment. Specifically, we introduce Handpose-wise Preference Optimization (HPO), which allows the model to continuously align with preferences from both positive and negative feedback while progressively refining its grasping strategies. To further enhance efficiency and reliability during online adjustments, we incorporate a Physics-aware Consistency Model within HPO, which accelerates inference, reduces the number of timesteps needed for preference fine-tuning, and ensures physical plausibility throughout the process.Extensive experiments across four benchmark datasets demonstrate state-of-the-art performance of our method in grasp success rate and sampling efficiency. Our results validate that EvolvingGrasp enables evolutionary grasp generation, ensuring robust, physically feasible, and preference-aligned grasping in both simulation and real scenarios.

Wed 22 Oct. 14:15 - 16:15 PDT

#158
Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal

wanchang Yu · Qing Zhang · Rongjia Zheng · Wei-Shi Zheng

We present a diffusion-based portrait shadow removal approach that can robustly produce high-fidelity results. Unlike previous methods, we cast shadow removal as diffusion-based inpainting. To this end, we first train a shadow-independent structure extraction network on a real-world portrait dataset with various synthetic lighting conditions, which allows to generate a shadow-independent structure map including facial details while excluding the unwanted shadow boundaries. The structure map is then used as condition to train a structure-guided inpainting diffusion model for removing shadows in a generative manner. Finally, to restore the fine-scale details (e.g., eyelashes, moles and spots) that may not be captured by the structure map, we take the gradients inside the shadow regions as guidance and train a detail restoration diffusion model to refine the shadow removal result. Extensive experiments on the benchmark datasets show that our method clearly outperforms existing methods, and is effective to avoid previously common issues such as facial identity tampering, shadow residual, color distortion, structure blurring, and loss of details. Our code will be made publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#159
Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization

Kangle Deng · Hsueh-Ti Derek Liu · Yiheng Zhu · Xiaoxia Sun · Chong Shang · Kiran Bhat · Deva Ramanan · Jun-Yan Zhu · Maneesh Agrawala · Tinghui Zhou

Many 3D generative models rely on variational autoencoders (VAEs) to learn compact shape representations. However, existing methods encode all shapes into a fixed-size token, disregarding the inherent variations in scale and complexity across 3D data. This leads to inefficient latent representations that can compromise downstream generation. We address this challenge by introducing Octree-based Adaptive Tokenization, a novel framework that adjusts the dimension of latent representations according to shape complexity. Our approach constructs an adaptive octree structure guided by a quadric-error-based subdivision criterion and allocates a shape latent vector to each octree cell using a query-based transformer. Building upon this tokenization, we develop an octree-based autoregressive generative model that effectively leverages these variable-sized representations in shape generation. Extensive experiments demonstrate that our approach reduces token counts by 50% compared to fixed-size methods while maintaining comparable visual quality. When using a similar token length, our method produces significantly higher-quality shapes. When incorporated with our downstream generative model, our method creates more detailed and diverse 3D content than existing approaches.

Wed 22 Oct. 14:15 - 16:15 PDT

#160
MedVSR: Medical Video Super-Resolution with Cross State-Space Propagation

Xinyu Liu · Guolei Sun · Cheng Wang · Yixuan Yuan · Ender Konukoglu

High-resolution (HR) medical videos are vital for accurate diagnosis, yet are hard to acquire due to hardware limitations and physiological constraints. Clinically, the collected low-resolution (LR) medical videos present unique challenges for video super-resolution (VSR) models, including camera shake, noise, and abrupt frame transitions, which result in significant optical flow errors and alignment difficulties. Additionally, tissues and organs exhibit continuous and nuanced structures, but current VSR models are prone to introducing artifacts and distorted features that can mislead doctors. To this end, we propose MedVSR, a tailored framework for medical VSR. It first employs Cross State-Space Propagation (CSSP) to address the imprecise alignment by projecting distant frames as control matrices within state-space models, enabling the selective propagation of consistent and informative features to neighboring frames for effective alignment. Moreover, we design an Inner State-Space Reconstruction (ISSR) module that enhances tissue structures and reduces artifacts with joint long-range spatial feature learning and large-kernel short-range information aggregation. Experiments across four datasets in diverse medical scenarios, including endoscopy and cataract surgeries, show that MedVSR significantly outperforms existing state-of-the-art models in reconstruction performance and efficiency.

Wed 22 Oct. 14:15 - 16:15 PDT

#161
EAMamba: Efficient All-Around Vision State Space Model for Image Restoration

Yu-Cheng Lin · Yu-Syuan Xu · Hao-Wei Chen · Hsien-Kai Kuo · Chun-Yi Lee

Image restoration is a key task in low-level computer vision that aims to reconstruct high-quality images from degraded inputs. The emergence of Vision Mamba, which draws inspiration from the advanced state space model Mamba, marks a significant advancement in this field. Vision Mamba demonstrates excellence in modeling long-range dependencies with linear complexity, a crucial advantage for image restoration tasks. Despite its strengths, Vision Mamba encounters challenges in low-level vision tasks, including computational complexity that scales with the number of scanning sequences and local pixel forgetting. To address these limitations, this study introduces Efficient All-Around Mamba (EAMamba), an enhanced framework that incorporates a Multi-Head Selective Scan Module (MHSSM) with an all-around scanning mechanism. MHSSM efficiently aggregates multiple scanning sequences, which avoids increases in computational complexity and parameter count. The all-around scanning strategy implements multiple patterns to capture holistic information and resolves the local pixel forgetting issue. Our experimental evaluations validate these innovations across several restoration tasks, including super resolution, denoising, deblurring, and dehazing. The results validate that EAMamba achieves a significant 31-89% reduction in FLOPs while maintaining favorable performance compared to existing low-level Vision Mamba methods.

Wed 22 Oct. 14:15 - 16:15 PDT

#162
Highlight
Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search

Shuyu Yang · Yaxiong Wang · Li Zhu · Zhedong Zheng

Text-based person search aims to retrieve specific individuals across camera networks using natural language descriptions. However, current benchmarks often exhibit biases towards common actions like walking or standing, neglecting the critical need for identifying abnormal behaviors in real-world scenarios. To meet such demands, we propose a new task, text-based person anomaly search, locating pedestrians engaged in both routine or anomalous activities via text. To enable the training and evaluation of this new task, we construct a large-scale image-text Pedestrian Anomaly Behavior (PAB) benchmark, featuring a broad spectrum of actions, e.g., running, performing, playing soccer, and the corresponding anomalies, e.g., lying, being hit, and falling of the same identity. The training set of PAB comprises 1,013,605 synthesized image-text pairs of both normalities and anomalies, while the test set includes 1,978 real-world image-text pairs. To validate the potential of PAB, we introduce a cross-modal pose-aware framework, which integrates human pose patterns with identity-based hard negative pair sampling. Extensive experiments on the proposed benchmark show that synthetic training data facilitates the fine-grained behavior retrieval, and the proposed pose-aware method arrives at 84.93% recall@1 accuracy, surpassing other competitive methods.

Wed 22 Oct. 14:15 - 16:15 PDT

#163
SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis

Wenkun He · Yun Liu · Ruitao Liu · Li Yi

Synthesizing realistic human-object interaction motions is a critical problem in VR/AR and human animation. Unlike the commonly studied scenarios involving a single human or hand interacting with one object, we address a more generic multi-body setting with arbitrary numbers of humans, hands, and objects. The high correlations and mutual influences among bodies leads to two major challenges, for which we propose solutions. First, to satisfy the high demands for synchronization of different body motions, we mathematically derive a new set of alignment scores during the training process, and use maximum likelihood sampling on a dynamic graphical model for explicit synchronization during inference. Second, the high-frequency interactions between objects are often overshadowed by the large-scale low-frequency movements. To address this, we introduce frequency decomposition and explicitly represent high-frequency components in the frequency domain. Extensive experiments across five datasets with various multi-body configurations demonstrate the superiority of SyncDiff over existing state-of-the-art motion synthesis methods.

Wed 22 Oct. 14:15 - 16:15 PDT

#164
MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

Zihan Cao · Yu Zhong · Ziqi Wang · Liang-Jian Deng

Image fusion, a fundamental low-level vision task, aims to integrate multiple image sequences into a single output while preserving as much information as possible from the input. However, existing methods face several significant limitations: 1) requiring task- or dataset-specific models; 2) neglecting real-world image degradations (e.g., noise), which causes failure when processing degraded inputs; 3) operating in pixel space, where attention mechanisms are computationally expensive; and 4) lacking user interaction capabilities.To address these challenges, we propose a unified framework for multi-task, multi-degradation, and language-guided image fusion. Our framework includes two key components: 1) a practical degradation pipeline that simulates real-world image degradations and generates interactive prompts to guide the model; 2) an all-in-one Diffusion Transformer (DiT) operating in latent space, which fuses a clean image conditioned on both the degraded inputs and the generated prompts. Furthermore, we introduce principled modifications to the original DiT architecture to better suit the fusion task. Based on this framework, we develop two versions of the model: Regression-based and Flow Matching-based variants.Extensive qualitative and quantitative experiments demonstrate that our approach effectively addresses the aforementioned limitations and outperforms previous restoration+fusion and all-in-one pipelines.

Wed 22 Oct. 14:15 - 16:15 PDT

#165
Fast Image Super-Resolution via Consistency Rectified Flow

Jiaqi Xu · Wenbo Li · Haoze Sun · Fan Li · Zhixin Wang · Long Peng · Jingjing Ren · HAORAN YANG · Xiaowei Hu · Renjing Pei · Pheng-Ann Heng

Diffusion models (DMs) have demonstrated remarkable success in real-world image super-resolution (SR), yet their reliance on time-consuming multi-step sampling largely hinders their practical applications. While recent efforts have introduced few- or single-step solutions, existing methods either inefficiently model the process from noisy input or fail to fully exploit iterative generative priors, compromising the fidelity and quality of the reconstructed images. To address this issue, we propose FlowSR, a novel approach that reformulates the SR problem as a rectified flow from low-resolution (LR) to high-resolution (HR) images. Our method leverages an improved consistency learning strategy to enable high-quality SR in a single step. Specifically, we refine the original consistency distillation process by incorporating HR regularization, ensuring that the learned SR flow not only enforces self-consistency but also converges precisely to the ground-truth HR target. Furthermore, we introduce a fast-slow scheduling strategy, where adjacent timesteps for consistency learning are sampled from two distinct schedulers: a fast scheduler with fewer timesteps to improve efficiency, and a slow scheduler with more timesteps to capture fine-grained texture details. This strategy enhances the model's robustness, enabling accurate restoration even when mild perturbations occur in the flow trajectory. Extensive experiments demonstrate that FlowSR achieves outstanding performance in both efficiency and image quality.

Wed 22 Oct. 14:15 - 16:15 PDT

#166
Highlight
GENMO: A GENeralist Model for Human MOtion

Jiefeng Li · Jinkun Cao · Haotian Zhang · Davis Rempe · Jan Kautz · Umar Iqbal · Ye Yuan

Human motion modeling traditionally separates motion generation and estimation into distinct tasks with specialized models. Motion generation models focus on creating diverse, realistic motions from inputs like text, audio, or keyframes, while motion estimation models aim to reconstruct accurate motion trajectories from observations like videos. Despite sharing underlying representations of temporal dynamics and kinematics, this separation limits knowledge transfer between tasks and requires maintaining separate models. We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework. Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals. Leveraging the synergy between regression and diffusion, GENMO achieves accurate global motion estimation while enabling diverse motion generation. We also introduce an estimation-guided training objective that exploits in-the-wild videos with 2D annotations and text descriptions to enhance generative diversity. Furthermore, our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control. This unified approach creates synergistic benefits: generative priors improve estimated motions under challenging conditions like occlusions, while diverse video data enhances generation capabilities. Extensive experiments demonstrate GENMO's effectiveness as a generalist framework that successfully handles multiple human motion tasks within a single model.

Wed 22 Oct. 14:15 - 16:15 PDT

#167
VAFlow: Video-to-Audio Generation with Cross-Modality Flow Matching

Xihua Wang · Xin Cheng · Yuyue Wang · Ruihua Song · Yunfeng Wang

Video-to-audio (V2A) generation aims to synthesize temporally aligned, realistic sounds for silent videos, a critical capability for immersive multimedia applications. Current V2A methods, predominantly based on diffusion or flow models, rely on suboptimal noise-to-audio paradigms that entangle cross-modal mappings with stochastic priors, resulting in inefficient training and convoluted transport paths. We propose VAFlow, a novel flow-based framework that directly models the video-to-audio transformation, eliminating reliance on noise priors. To address modality discrepancies, we employ an alignment variational autoencoder (VAE) that compresses heterogeneous video features into audio-aligned latent spaces while preserving spatiotemporal semantics. By retaining cross-attention mechanisms between video features and flow blocks, our architecture enables classifier-free guidance within video source-driven generation. Without external data or complex training tricks, VAFlow achieves state-of-the-art performance on VGGSound benchmark, surpassing even text-augmented models in audio fidelity, diversity, and distribution alignment. This work establishes a new paradigm for V2A generation with a direct and effective video-to-audio transformation via flow matching.

Wed 22 Oct. 14:15 - 16:15 PDT

#168
Event-guided HDR Reconstruction with Diffusion Priors

Yixin Yang · jiawei zhang · Yang Zhang · Yunxuan Wei · Dongqing Zou · Jimmy Ren · Boxin Shi

Events provide High Dynamic Range (HDR) intensity change which can guide Low Dynamic Range (LDR) image for HDR reconstruction. However, events only provide temporal intensity differences and it is still ill-posed in over-/under-exposed areas due to missing initial reference brightness and color information. With strong generation ability, diffusion models have shown their potential for tackling ill-posed problems. Therefore, we introduce conditional diffusion models to hallucinate missing information. Whereas, directly adopting events and LDR image as conditions is complicated for diffusion models to sufficiently utilize their information. Thus we introduce a pretrained events-image encoder tailored for HDR reconstruction and a pyramid fusion module to provide HDR conditions, which can be efficiently and effectively utilized by the diffusion model. Moreover, the generation results of diffusion models usually exhibit distortion, particularly for fine-grained details. To better preserve fidelity and suppress distortion, we propose a fine-grained detail recovery approach using a histogram-based structural loss. Experiments on real and synthetic data show the effectiveness of the proposed method in terms of both detail preservation and information hallucination.

Wed 22 Oct. 14:15 - 16:15 PDT

#169
Learning Efficient and Generalizable Human Representation with Human Gaussian Model

Yifan Liu · Shengjun Zhang · Chensheng Dai · Yang Chen · Hao Liu · Chen Li · Yueqi Duan

Modeling animatable human avatars from videos is a long-standing and challenging problem. While conventional methods require per-instance optimization, recent feed-forward methods have been proposed to generate 3D Gaussians with a learnable network.However, these methods predict independent Gaussians for each frame prediction without fully capturing the relations of Gaussians from different frames, which are hard to be animated by novel poses. To address this, we propose Human Gaussian Graph (HGG) to generate generalizable and animatable Gaussian representations. Specifically, we construct a dual-layer graph to model the relations between predicted Gaussians from multiple frames and SMPL mesh. We design an intra-node operation to aggregate various Gaussian information at different timesteps to benefit from video inputs. Furthermore, we propose an inter-node operation to support message passing between SMPL vertices. In this manner, we leverage the human structure prior to recover generalizable and animatable Gaussian representations.Experimental results on novel view synthesis and novel pose animation demonstrate the efficiency and generalization of our method.

Wed 22 Oct. 14:15 - 16:15 PDT

#170
SMGDiff: Soccer Motion Generation using Diffusion Probabilistic Models

Hongdi Yang · Chengyang Li · Zhenxuan Wu · Gaozheng Li · Jingya Wang · Jingyi Yu · Zhuo Su · Lan Xu

Soccer is a globally renowned sport with significant applications in video games and VR/AR. However, generating realistic soccer motions remains challenging due to the intricate interactions between the player and the ball. In this paper, we introduce SMGDiff, a novel two-stage framework for generating real-time and user-controllable soccer motions. Our key idea is to integrate real-time character animation with a powerful diffusion-based generative model. Specifically, we first map coarse user control to intricate character trajectories. Then, we employ a transformer-based autoregressive diffusion model to generate soccer motions based on trajectory conditioning. For further physical realism, we integrate a contact guidance module during inference to refine precise ball-foot interactions.Additionally, we contribute a large-scale soccer motion dataset consisting of over 1.08 million frames of diverse soccer motions. Extensive experiments demonstrate that our SMGDiff significantly outperforms existing methods in terms of motion quality and condition alignment.

Wed 22 Oct. 14:15 - 16:15 PDT

#171
AffordDexGrasp: Open-set Language-guided Dexterous Grasp with Generalizable-Instructive Affordance

Yilin Wei · Mu Lin · Yuhao Lin · Jian-Jian Jiang · Xiao-Ming Wu · Ling-An Zeng · Wei-Shi Zheng

Language-guided robot dexterous generation enables robots to grasp and manipulate objects based on human commands. However, previous data-driven methods are hard to understand intention and execute grasping with unseen categories in the open set. In this work, we explore a new task, Open-set Language-guided Dexterous Grasp, and find that the main challenge is the huge gap between high-level human language semantics and low-level robot action. To solve this problem, we propose an Affordance Dexterous Grasp (AffordDexGrasp) framework, with the insight that bridging the gap with a new generalizable-instructive affordance representation. This affordance can generalize to unseen categories by leveraging the object's local structure and category-agnostic semantic attributes, thereby effectively guiding dexterous grasp generation. Built upon the affordance, our framework introduces Affordacne Flow Matching (AFM) for affordance generation with language as input, and Grasp Flow Matching (GFM) for generating dexterous grasp with affordance as input. To evaluate our framework, we build an open-set table-top language-guided dexterous grasp dataset. Extensive experiments in the simulation and real worlds show that our framework surpasses all previous methods in both seen category and unseen category generalization.

Wed 22 Oct. 14:15 - 16:15 PDT

#172
Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation

Jiahua Dong · Hui Yin · Wenqi Liang · Hanbin Zhao · Henghui Ding · Nicu Sebe · Salman Khan · Fahad Khan

Video instance segmentation (VIS) has gained significant attention for its capability in segmenting and tracking object instances across video frames. However, most of the existing VIS methods unrealistically assume that the categories of object instances remain fixed over time. Moreover, they experience catastrophic forgetting of old classes when required to continuously learn object instances belonging to new classes. To address the above challenges, we develop a novel Hierarchical Visual Prompt Learning (HVPL) model, which alleviates catastrophic forgetting of old classes from both frame-level and video-level perspectives. Specifically, to mitigate forgetting at the frame level, we devise a task-specific frame prompt and an orthogonal gradient correction (OGC) module. The OGC module helps the frame prompt encode task-specific global instance information for new classes in each individual frame by projecting its gradients onto the orthogonal feature space of old classes. Furthermore, to address forgetting at the video level, we design a task-specific video prompt and a video context decoder. This decoder first embeds structural inter-class relationships across frames into the frame prompt feature, and then propagates task-specific global video contexts from the frame prompt features to the video prompt. Experiments verify the effectiveness of our HVPL model compared to other methods.

Wed 22 Oct. 14:15 - 16:15 PDT

#173
Highlight
LVFace: Progressive Cluster Optimization for Large Vision Models in Face Recognition

Jinghan You · Shanglin Li · Yuanrui Sun · Jiangchuanwei Wei · Mingyu Guo · Chao Feng · Jiao Ran

Vision Transformers (ViTs) have revolutionized large-scale visual modeling, yet remain underexplored in face recognition (FR) where CNNs still dominate. We identify a critical bottleneck: CNN-inspired training paradigms fail to unlock ViT's potential, leading to suboptimal performance and convergence instability.To address this challenge, we propose LVFace, a ViT-based FR model that integrates Progressive Cluster Optimization (PCO) to achieve superior results. Specifically, PCO sequentially applies negative class sub-sampling (NCS) for robust and fast feature alignment from random initialization, feature expectation penalties for centroid stabilization, performing cluster boundary refinement through full-batch training without NCS constraints. LVFace establishes a new state-of-the-art face recognition baseline, surpassing leading approaches such as UniFace and TopoFR across multiple benchmarks. Extensive experiments demonstrate that LVFace delivers consistent performance gains, while exhibiting scalability to large-scale datasets and compatibility with mainstream VLMs and LLMs. Notably, LVFace secured 1st place in the ICCV 2021 Masked Face Recognition (MFR)-Ongoing Challenge (March 2025), proving its efficacy in real-world scenarios.

Wed 22 Oct. 14:15 - 16:15 PDT

#174
Enpowering Your Pansharpening Models with Generalizability: Unified Distribution is All You Need

Yongchuan Cui · Peng Liu · HUI ZHANG

Existing deep learning-based models for remote sensing pansharpening exhibit exceptional performance on training datasets. However, due to sensor-specific characteristics and varying imaging conditions, these models suffer from substantial performance degradation when applied to unseen satellite data, lacking generalizability and thus limiting their applicability. We argue that the performance drops stem primarily from distributional discrepancies from different sources and the key to addressing this challenge lies in bridging the gap between training and testing distributions. To validate the idea and further achieve a “train once, deploy forever” capability, this paper introduces a novel and intuitive approach to enpower any pansharpening models with generalizability by employing a unified distribution strategy (UniPAN). Specifically, we construct a distribution transformation function that normalizes the pixels sampled from different sources to conform to an identical distribution. The deep models are trained on the transformed domain, and during testing on new datasets, the new data are also transformed to match the training distribution. UniPAN aims to train and test the model on a unified and consistent distribution, thereby enhancing its generalizability. Extensive experiments validate the efficacy of UniPAN, demonstrating its potential to significantly enhance the performance of deep pansharpening models across diverse satellite sensors. Codes will be publicly available at github.

Wed 22 Oct. 14:15 - 16:15 PDT

#175
MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation

Yanchen Liu · Yanan SUN · Zhening Xing · Junyao Gao · Kai Chen · Wenjie Pei

Existing text-to-video methods struggle to transfer motion smoothly from a reference object to a target object with significant differences in appearance or structure between them. To address this challenge, we introduce MotionShot, a training-free framework capable of parsing reference-target correspondences in a fine-grained manner, thereby achieving high-fidelity motion transfer while preserving coherence in appearance. To be specific, MotionShot first performs semantic feature matching to ensure high-level alignments between the reference and target objects. It then further establishes low-level morphological alignments through reference-to-target shape retargeting. By encoding motion with temporal attention, our MotionShot can coherently transfer motion across objects, even in the presence of significant appearance and structure disparities, demonstrated by extensive experiments. Code will be released.

Wed 22 Oct. 14:15 - 16:15 PDT

#176
Robust Adverse Weather Removal via Spectral-based Spatial Grouping

Yuhwan Jeong · Yunseo Yang · Youngho Yoon · Kuk-Jin Yoon

Adverse weather conditions cause diverse and complex degradation patterns, driving the development of All-in-One (AiO) models.However, recent AiO solutions still struggle to capture diverse degradations, since global filtering methods like direct operations on the frequency domain fail to handle highly variable and localized distortions.To address these issue, we propose Spectral-based Spatial Grouping Transformer (SSGformer), a novel approach that leverages spectral decomposition and group-wise attention for multi-weather image restoration. SSGformer decomposes images into high-frequency edge features using conventional edge detection and low-frequency information via Singular Value Decomposition.We utilize multi-head linear attention to effectively model the relationship between these features.The fused features are integrated with the input to generate a grouping-mask that clusters regions based on the spatial similarity and image texture. To fully leverage this mask, we introduce a group-wise attention mechanism, enabling robust adverse weather removal and ensuring consistent performance across diverse weather conditions.We also propose a Spatial Grouping Transformer Block that uses both channel attention and spatial attention, effectively balancing feature-wise relationships and spatial dependencies.Extensive experiments show the superiority of our approach, validating its effectiveness in handling the varied and intricate adverse weather degradations. The code will be available soon.

Wed 22 Oct. 14:15 - 16:15 PDT

#177
CarGait: Cross-Attention based Re-ranking for Gait recognition

Gavriel Habib · Noa Barzilay · Or Shimshi · Rami Ben-Ari · Nir Darshan

Gait recognition is a computer vision task that identifies individuals based on their walking patterns. Gait recognition performance is commonly evaluated by ranking a gallery of candidates and measuring the accuracy at the top Rank-$K$. Existing models are typically single-staged, i.e. searching for the probe's nearest neighbors in a gallery using a single global feature representation. Although these models typically excel at retrieving the correct identity within the top-$K$ predictions, they struggle when hard negatives appear in the top short-list, leading to relatively low performance at the highest ranks (e.g., Rank-1). In this paper, we introduce CarGait, a Cross-Attention Re-ranking method for gait recognition, that involves re-ordering the top-$K$ list leveraging the fine-grained correlations between pairs of gait sequences through cross-attention between gait strips. This re-ranking scheme can be adapted to existing single-stage models to enhance their final results. We demonstrate the capabilities of CarGait by extensive experiments on three common gait datasets, Gait3D, GREW, and OU-MVLP, and seven different gait models, showing consistent improvements in Rank-1,5 accuracy, superior results over existing re-ranking methods, and strong baselines.

Wed 22 Oct. 14:15 - 16:15 PDT

#178
Lightweight and Fast Real-time Image Enhancement via Decomposition of the Spatial-aware Lookup Tables

Wontae Kim · Keuntek Lee · Nam Ik Cho

A 3D lookup table (3D LUT) is a classic yet effective tool for image enhancement and restoration tasks, even in the deep learning era. The 3D LUT efficiently reduces model size and runtime by instantly transforming an input color value into another color value through interpolation of pre-calculated values at the vertices. However, a limitation of 3D LUT transforms is their lack of spatial information, as they convert color values on a point-by-point basis. To address this weakness, researchers have explored spatial-aware 3D LUT methods, which provide spatial features through additional modules. While spatial-aware 3D LUT methods show promising performance, the extra modules introduce a substantial number of parameters and an increased runtime, particularly as the resolution of the input image rises. To tackle this issue, we propose a method for generating image-adaptive 3D LUTs by considering the redundant parts of tables. We introduce an efficient framework that decomposes the 3D LUT into a linear sum of low-dimensional LUTs and utilizes singular value decomposition (SVD). Additionally, we modify the modules for spatial features to be more cache-efficient and image-adaptive, thereby reducing both runtime and improving performance. Our model effectively reduces the number of parameters and runtime, while maintaining competitive performance, as demonstrated by extensive experimental results.

Wed 22 Oct. 14:15 - 16:15 PDT

#179
WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image

Jiwoo Park · Tae Choi · Youngjun Jun · Seong Jae Hwang

Generating high-quality novel views of a scene from a single image requires maintaining structural coherence across different views, referred to as view consistency.While diffusion models have driven advancements in novel view synthesis, they still struggle to preserve spatial continuity across views. Diffusion models have been combined with 3D models to address the issue, but such approaches lack efficiency due to their complex multi-step pipelines.This paper proposes a novel view-consistent image generation method which utilizes diffusion models without additional modules. Our key idea is to enhance diffusion models with a training-free method that enables adaptive attention manipulation and noise reinitialization by leveraging view-guided warping to ensure view consistency. Through our comprehensive metric framework suitable for novel-view datasets, we show that our method improves view consistency across various diffusion models, demonstrating its broader applicability.

Wed 22 Oct. 14:15 - 16:15 PDT

#180
Unsupervised Visible-Infrared Person Re-identification under Unpaired Settings

Haoyu Yao · Bin Yang · Wenke Huang · Mang Ye · Bo Du

Unsupervised visible-infrared person re-identification (USL-VI-ReID) aims to train a cross-modality retrieval model without labels, reducing the reliance on expensive cross-modality manual annotation. However, existing USL-VI-ReID methods rely on artificially cross-modality paired data as implicit supervision, which is also expensive for human annotation and contrary to the setting of unsupervised tasks. In addition, this full alignment of identity across modalities is inconsistent with real-world scenarios, where unpaired settings are prevalent. To this end, we study the USL-VI-ReID task under unpaired settings, which uses cross-modality unpaired and unlabeled data for training a VI-ReID model. We propose a novel Mapping and Collaborative Learning (MCL) framework. Specifically, we first design a simple yet effective Cross-modality Feature Mapping (CFM) module to map and generate fake cross-modality positive feature pairs, constructing a cross-modal pseudo-identity space for feature alignment. Then, a Static-Dynamic Collaborative (SDC) learning strategy is proposed to align cross-modality correspondences through a collaborative approach, eliminating inter-modality discrepancies across different aspects i.e., cluster-level and instance-level, in scenarios with cross-modal identity mismatches. Extensive experiments on the conducted SYSU-MM01 and RegDB benchmarks under paired and unpaired settings demonstrate that our proposed MCL significantly outperforms existing unsupervised methods, facilitating USL-VI-ReID to real-world deployment.

Wed 22 Oct. 14:15 - 16:15 PDT

#181
LightBSR: Towards Lightweight Blind Super-Resolution via Discriminative Implicit Degradation Representation Learning

Jiang Yuan · ji ma · Bo Wang · Guanzhou Ke · Weiming Hu

Implicit degradation estimation-based blind super-resolution (IDE-BSR) hinges on extracting the implicit degradation representation (IDR) of the LR image and adapting it to LR image features to guide HR detail restoration. Although IDE-BSR has shown potential in dealing with noise interference and complex degradations, existing methods ignore the importance of IDR discriminability for BSR and instead over-complicate the adaptation process to improve effect, resulting in a significant increase in the model's parameters and computations. In this paper, we focus on the discriminability optimization of IDR and propose a new powerful and lightweight BSR model termed LightBSR. Specifically, we employ a knowledge distillation-based learning framework. We first introduce a well-designed degradation-prior-constrained contrastive learning technique during teacher stage to make the model more focused on distinguishing different degradation types. Then we utilize a feature alignment technique to transfer the degradation-related knowledge acquired by the teacher to the student for practical inferencing. Extensive experiments demonstrate the effectiveness of IDR discriminability-driven BSR model design. The proposed LightBSR can achieve outstanding performance with minimal complexity across a range of blind SR tasks.

Wed 22 Oct. 14:15 - 16:15 PDT

#182
Multi-identity Human Image Animation with Structural Video Diffusion

Zhenzhi Wang · Yixuan Li · yanhong zeng · Yuwei Guo · Dahua Lin · Tianfan Xue · Bo Dai

Generating human videos from a single image while ensuring high visual quality and precise control is a challenging task, especially in complex scenarios involving multiple individuals and interactions with objects. Existing methods, while effective for single-human cases, often fail to handle the intricacies of multi-identity interactions because they struggle to associate the correct pairs of human appearance and pose condition and model the distribution of 3D-aware dynamics. To address these limitations, we present Structural Video Diffusion, a novel framework designed for generating realistic multi-human videos. Our approach introduces two core innovations: identity-specific embeddings to maintain consistent appearances across individuals and a structural learning mechanism that incorporates depth and surface-normal cues to model human-object interactions. Additionally, we expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios, providing a robust foundation for training. Experimental results demonstrate that Structural Video Diffusion achieves superior performance in generating lifelike, coherent videos for multiple subjects with dynamic and rich interactions, advancing the state of human-centric video generation.

Wed 22 Oct. 14:15 - 16:15 PDT

#183
Embodied Representation Alignment with Mirror Neurons

Wentao Zhu · Zhining Zhang · Yuwei Ren · Yin Huang · Hao Xu · Yizhou Wang

Mirror neurons are a class of neurons that activate both when an individual observes an action and when they perform the same action. This mechanism reveals a fundamental interplay between action understanding and embodied execution, suggesting that these two abilities are inherently connected. Nonetheless, existing machine learning methods largely overlook this interplay, treating these abilities as separate tasks. In this study, we provide a unified perspective in modeling them through the lens of representation learning. We first observe that their intermediate representations spontaneously align. Inspired by mirror neurons, we further introduce an approach that explicitly aligns the representations of observed and executed actions. Specifically, we employ two linear layers to map the representations to a shared latent space, where contrastive learning enforces the alignment of corresponding representations, effectively maximizing their mutual information. Experiments demonstrate that this simple approach fosters mutual synergy between the two tasks, effectively improving representation quality and generalization.

Wed 22 Oct. 14:15 - 16:15 PDT

#184
EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow

Yixiang Chen · Peiyan Li · Yan Huang · Jiabing Yang · Kehan Chen · Liang Wang

Current language-guided robotic manipulation systems often require low-level action-labeled datasets for imitation learning. While object-centric flow prediction methods mitigate this issue, they remain limited to scenarios involving rigid objects with clear displacement and minimal occlusion. In this work, we present Embodiment-Centric Flow (EC-Flow), a framework that directly learns manipulation from action-unlabeled videos by predicting embodiment-centric flow. Our key insight is that incorporating the embodiment's inherent kinematics significantly enhances generalization to versatile manipulation scenarios, including deformable object handling, occlusions, and non-object-displacement tasks. To connect the EC-Flow with language instructions and object interactions, we further introduce a goal-alignment module by jointly optimizing movement consistency and goal-image prediction. Moreover, translating EC-Flow to executable robot actions only requires a standard robot URDF (Unified Robot Description Format) file to specify kinematic constraints across joints, which makes it easy to use in practice. We validate EC-Flow on both simulation (Meta-World) and real-world tasks, demonstrating its state-of-the-art performance in occluded object handling (62% improvement), deformable object manipulation (45% improvement), and non-object-displacement tasks (80% improvement) than prior state-of-the-art object-centric flow methods.

Wed 22 Oct. 14:15 - 16:15 PDT

#185
Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos

Sagnik Majumder · Tushar Nagarajan · Ziad Al-Halah · Kristen Grauman

We introduce Switch-a-view, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video. The key insight of our approach is how to train such a model from unlabeled---but human-edited---video samples. We pose a pretext task that pseudo-labels segments in the training videos for their primary viewpoint (egocentric or exocentric), and then discovers the patterns between the visual and spoken content in a how-to video on the one hand and its view-switch moments on the other hand. Armed with this predictor, our model can be applied to new multi-view video settings for orchestrating which viewpoint should be displayed when, even when such settings come with limited labels. We demonstrate our idea on a variety of real-world videos from HowTo100M and Ego-Exo4D, and rigorously validate its advantages.

Wed 22 Oct. 14:15 - 16:15 PDT

#186
RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping

Dongming Wu · Yanping Fu · Saike Huang · Yingfei Liu · Fan Jia · Nian Liu · Feng Dai · Tiancai Wang · Rao Anwer · Fahad Khan · Jianbing Shen

General robotic grasping systems require accurate object affordance perception in diverse open-world scenarios following human instructions. However, current studies suffer from the problem of lacking reasoning-based large-scale affordance prediction data, leading to considerable concern about open-world effectiveness. To address this limitation, we build a large-scale grasping-oriented affordance segmentation benchmark with human-like instructions, named RAGNet. It contains 273k images, 180 categories, and 26k reasoning instructions. The images cover diverse embodied data domains, such as wild, robot, ego-centric, and even simulation data. They are carefully annotated with an affordance map, while the difficulty of language instructions is largely increased by removing their category name and only providing functional descriptions. Furthermore, we propose a comprehensive affordance-based grasping framework, named AffordanceNet, which consists of a VLM pre-trained on our massive affordance data and a grasping network that conditions an affordance map to grasp the target. Extensive experiments on affordance segmentation benchmarks and real-robot manipulation tasks show that our model has a powerful open-world generalization ability. Our code and benchmark will be released.

Wed 22 Oct. 14:15 - 16:15 PDT

#187
DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion

Maksim Siniukov · Di Chang · Minh Tran · Hongkun Gong · Ashutosh Chaubey · Mohammad Soleymani

Generating naturalistic and nuanced listener motions for extended interactions remains an open problem.Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness.To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen.Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8\% in FID on RealTalk) and motion representation (+6.1\% in FD metric on VICO) spaces.User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.

Wed 22 Oct. 14:15 - 16:15 PDT

#188
Hipandas: Hyperspectral Image Joint Denoising and Super-Resolution by Image Fusion with the Panchromatic Image

Shuang Xu · Zixiang Zhao · Haowen Bai · Chang Yu · Jiangjun Peng · Xiangyong Cao · Deyu Meng

Hyperspectral images (HSIs) are frequently noisy and of low resolution due to the constraints of imaging devices. Recently launched satellites can concurrently acquire HSIs and panchromatic (PAN) images, enabling the restoration of HSIs to generate clean and high-resolution imagery through fusing PAN images for denoising and super-resolution. However, previous studies treated these two tasks as independent processes, resulting in accumulated errors. This paper introduces Hyperspectral Image Joint Pandenoising and Pansharpening (Hipandas), a novel learning paradigm that reconstructs HRHS images from noisy low-resolution HSIs (LRHS) and high-resolution PAN images. The proposed unsupervised Hipandas framework consists of a guided denoising network, a guided super-resolution network, and a PAN reconstruction network, utilizing an HSI low-rank prior and a newly introduced detail-oriented low-rank prior. The interconnection of these networks complicates the training process, necessitating a two-stage training strategy to ensure effective training. Experimental results on both simulated and real-world datasets indicate that the proposed method surpasses state-of-the-art algorithms, yielding more accurate and visually pleasing HRHS images.

Wed 22 Oct. 14:15 - 16:15 PDT

#189
You Think, You ACT: The New Task of Arbitrary Text to Motion Generation

Runqi Wang · Caoyuan Ma · Guopeng Li · Hanrui Xu · Yuke Li · Zheng Wang

Text to Motion aims to generate human motions from texts. Existing settings rely on limited Action Texts that include action labels (e.g., "walk, bend"), which limits flexibility and practicability in scenarios difficult to describe directly. This paper extends limited Action Texts to arbitrary ones. Scene texts without explicit action labels can enhance the practicality of models in complex and diverse industries such as virtual human interaction, robot behavior generation, and film production, while also supporting the exploration of potential implicit behavior patterns. However, newly introduced Scene Texts may yield multiple reasonable output results, causing significant challenges in existing data, framework, and evaluation.To address this practical issue, we first create a new dataset, HumanML3D++, by extending texts of the largest existing dataset, HumanML3D. Secondly, we propose a simple yet effective framework that extracts action instructions from arbitrary texts and subsequently generates motions. Furthermore, we also benchmark this new setting with multi-solution metrics to address the inadequacies of existing single-solution metrics. Extensive experiments indicate that Text to Motion in this realistic setting is challenging, fostering new research in this practical direction. Our data, model, and code will be released.

Wed 22 Oct. 14:15 - 16:15 PDT

#190
EgoMusic-driven Human Dance Motion Estimation with Skeleton Mamba

Quang Nguyen · Nhat Le · Baoru Huang · Minh VU · Chengcheng Tang · Van Nguyen · Ngan Le · Thieu Vo · Anh Nguyen

Estimating human dance motion is a challenging task with various industrial applications. Recently, many efforts have focused on predicting human dance motion using either egocentric video or music as input. However, the task of jointly estimating human motion from both egocentric video and music remains largely unexplored. In this paper, we aim to develop a new method that predicts human dance motion from both egocentric video and music. In practice, the egocentric view often obscures much of the body, making accurate full-pose estimation challenging. Additionally, incorporating music requires the generated head and body movements to align well with both visual and musical inputs. We first introduce EgoAIST++, a new large-scale dataset that combines both egocentric views and music with more than 36 hours of dancing motion. Drawing on the success of diffusion models and Mamba on modeling sequences, we develop an EgoMusic Motion Network with a core Skeleton Mamba that explicitly captures the skeleton structure of the human body. We illustrate that our approach is theoretically supportive. Intensive experiments show that our method clearly outperforms state-of-the-art approaches and generalizes effectively to real-world data.

Wed 22 Oct. 14:15 - 16:15 PDT

#191
PersonaCraft: Personalized and Controllable Full-Body Multi-Human Scene Generation Using Occlusion-Aware 3D-Conditioned Diffusion

Gwanghyun Kim · Suh Jeon Jeon · Seunggyu Lee · Se Young Chun

Personalized image generation has been significantly advanced, enabling the creation of highly realistic and customized images. However, existing methods often struggle with generating images of multiple people due to occlusions and fail to accurately personalize full-body shapes. In this paper, we propose PersonaCraft, a novel approach that combines diffusion models with 3D human modeling to address these limitations. Our method effectively manages occlusions by incorporating 3D-aware pose conditioning with SMPLx-ControlNet and accurately personalizes human full-body shapes through SMPLx fitting. Additionally, PersonaCraft enables user-defined body shape adjustments, adding flexibility for individual body customization. Experimental results demonstrate the superior performance of PersonaCraft in generating high-quality, realistic images of multiple individuals while resolving occlusion issues, thus establishing a new standard for multi-person personalized image synthesis.

Wed 22 Oct. 14:15 - 16:15 PDT

#192
DADM: Dual Alignment of Domain and Modality for Face Anti-spoofing

Yang JingYi · Xun Lin · Zitong YU · Liepiao Zhang · Xin Liu · Hui Li · Xiaochen Yuan · Xiaochun Cao

With the availability of diverse sensor modalities (i.e., RGB, Depth, Infrared) and the success of multi-modal learning, multi-modal face anti-spoofing (FAS) has emerged as a prominent research focus. The intuition behind it is that leveraging multiple modalities can uncover more intrinsic spoofing traces. However, this approach presents more risk of misalignment. We identify two main types of misalignment: (1) Intra-domain modality misalignment, where the importance of each modality varies across different attacks. For instance, certain modalities (e.g., Depth) may be non-defensive against specific attacks (e.g., 3D mask), indicating that each modality has unique strengths and weaknesses in countering particular attacks. Consequently, simple fusion strategies may fall short. (2) Inter-domain modality misalignment, where the introduction of additional modalities exacerbates domain shifts, potentially overshadowing the benefits of complementary fusion. To tackle (1), we propose a fusion module based on mutual information maximization, which adaptively enhances favorable modalities while suppressing unfavorable ones. To address (2), we employ a dual alignment optimization method that aligns both sub-domain hyperplanes and modality angle margins, thereby mitigating domain gaps. Our method, dubbed Dual Alignment of Domain and Modality (DADM), achieves state-of-the-art performance in extensive experiments across four challenging protocols demonstrating its robustness in multi-modal domain generalization scenarios. The codes and protocols will be released soon.

Wed 22 Oct. 14:15 - 16:15 PDT

#193
Autoregressive Denoising Score Matching is a Good Video Anomaly Detector

hanwen Zhang · Congqi Cao · Qinyi Lv · Lingtong Min · Yanning Zhang

Video anomaly detection (VAD) is an important computer vision problem. Thanks to the mode coverage capabilities of generative models, the likelihood-based paradigm is catching growing interest, as it can model normal distribution and detect out-of-distribution anomalies. However, these likelihood-based methods are blind to the anomalies located in local modes near the learned distribution. To handle these ``unseen" anomalies, we dive into three gaps uniquely existing in VAD regarding scene, motion and appearance. Specifically, we first build a noise-conditioned score transformer for denoising score matching. Then, we introduce a scene-dependent and motion-aware score function by embedding the scene condition of input sequences into our model and assigning motion weights based on the difference between key frames of input sequences. Next, to solve the problem of blindness in principle, we integrate unaffected visual information via a novel autoregressive denoising score matching mechanism for inference. Through autoregressively injecting intensifying Gaussian noise into the denoised data and estimating the corresponding score function, we compare the denoised data with the original data to get a difference and aggregate it with the score function for an enhanced appearance perception and accumulate the abnormal context. With all three gaps considered, we can compute a more comprehensive anomaly indicator. Experiments on three popular VAD benchmarks demonstrate the state-of-the-art performance of our method.

Wed 22 Oct. 14:15 - 16:15 PDT

#194
Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation

Shaowei Liu · chuan guo · Bing Zhou · Jian Wang

Close-proximity human-human interactive poses convey rich contextual information about interaction dynamics. Given such poses, humans can intuitively infer the context and anticipate possible past and future dynamics, drawing on strong priors of human behavior. Inspired by this observation, we propose Ponimator, a simple framework anchored on proximal interactive poses for versatile interaction animation. Our training data consists of close-contact two-person poses and their surrounding temporal context from motion-capture interaction datasets. Leveraging interactive pose priors, Ponimator employs two conditional diffusion models: (1) a pose animator that uses the temporal prior to generate dynamic motion sequences from interactive poses, and (2) a pose generator that applies the spatial prior to synthesize interactive poses from a single pose, text, or both when interactive poses are unavailable. Collectively, Ponimator supports diverse tasks, including image-based interaction animation, reaction animation, and text-to-interaction synthesis, facilitating the transfer of interaction knowledge from high-quality mocap data to open-world scenarios. Empirical experiments across diverse datasets and applications demonstrate the universality of the pose prior and the effectiveness and robustness of our framework.

Wed 22 Oct. 14:15 - 16:15 PDT

#195
Monocular Facial Appearance Capture in the Wild

Yingyan Xu · Kate Gadola · Prashanth Chandran · Sebastian Weiss · Markus Gross · Gaspard Zoss · Derek Bradley

We present a new method for reconstructing the appearance properties of human faces from a lightweight capture procedure in an unconstrained environment. Our method recovers the surface geometry, diffuse albedo, specular intensity and specular roughness from a monocular video containing a simple head rotation in-the-wild. Notably, we make no simplifying assumptions on the environment lighting, and we explicitly take visibility and occlusions into account. As a result, our method can produce facial appearance maps that approach the fidelity of studio-based multi-view captures, but with a far easier and cheaper procedure.

Wed 22 Oct. 14:15 - 16:15 PDT

#196
Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars

Tobias Kirschstein · Javier Romero · Artem Sevastopolsky · Matthias Nießner · Shunsuke Saito

Traditionally, creating photo-realistic 3D head avatars requires a studio-level multi-view capture setup and expensive optimization during test-time, limiting the use of digital human doubles to the VFX industry or offline renderings. To address this shortcoming, we present Avat3r, which regresses a high-quality and animatable 3D head avatar from just a few input images, vastly reducing compute requirements during inference. More specifically, we make Large Reconstruction Models animatable and learn a powerful prior over 3D human heads from a large multi-view video dataset. For better 3D head reconstructions, we employ position maps from DUSt3R and generalized feature maps from the human foundation model Sapiens. To animate the 3D head, our key discovery is that simple cross-attention to an expression code is already sufficient. Finally, we increase robustness by feeding input images with different expressions to our model during training, enabling the reconstruction of 3D head avatars from inconsistent inputs, e.g., an imperfect phone capture with accidental movement, or frames from a monocular video. We compare Avat3r with current state-of-the-art methods for few-input and single-input scenarios, and find that our method has a competitive advantage in both tasks. Finally, we demonstrate the wide applicability of our proposed model, creating 3D head avatars from images of different sources, smartphone captures, single images, and even out-of-domain inputs like antique busts.

Wed 22 Oct. 14:15 - 16:15 PDT

#197
Skeleton Motion Words for Unsupervised Skeleton-based Temporal Action Segmentation

Uzay Gökay · Federico Spurio · Dominik Bach · Juergen Gall

Current state-of-the-art methods for skeleton-based temporal action segmentation are fully supervised and require annotated data, which is expensive to collect. In contrast, existing unsupervised temporal action segmentation methods have focused primarily on video data, while skeleton sequences remain underexplored, despite their relevance to real-world applications, robustness, and privacy-preserving nature. In this paper, we propose a novel approach for unsupervised skeleton-based temporal action segmentation. Our method utilizes a sequence-to-sequence temporal autoencoder that keeps the information of the different joints disentangled in the embedding space. The latent representation is then segmented into non-overlapping patches and quantized to obtain distinctive skeleton motion words, driving the discovery of semantically meaningful action clusters. We thoroughly evaluate our model on three widely used skeleton-based datasets, namely HuGaDB, LARa, and BABEL. Our results demonstrate that SMQ outperforms the current state-of-the-art unsupervised temporal action segmentation methods.

Wed 22 Oct. 14:15 - 16:15 PDT

#198
MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

Quanhao Li · Zhen Xing · Rui Wang · Hui Zhang · Qi Dai · Zuxuan Wu

Recent advances in video generation have led to remarkable improvements in visual quality and temporal coherence. Upon this, trajectory-controllable video generation has emerged to enable precise object motion control through explicitly defined spatial paths.However, existing methods struggle with complex object movements and multi-object motion control, resulting in imprecise trajectory adherence, poor object consistency, and compromised visual quality.Furthermore, these methods only support trajectory control in a single format, limiting their applicability in diverse scenarios.Additionally, there is no publicly available dataset or benchmark specifically tailored for trajectory-controllable video generation, hindering robust training and systematic evaluation.To address these challenges, we introduce MagicMotion, a novel image-to-video generation framework that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes. Given an input image and trajectories, MagicMotion seamlessly animates objects along defined trajectories while maintaining object consistency and visual quality.Furthermore, we present MagicData, a large-scale trajectory-controlled video dataset, along with an automated pipeline for annotation and filtering. We also introduce MagicBench, a comprehensive benchmark that assesses both video quality and trajectory control accuracy across different numbers of objects.Extensive experiments demonstrate that MagicMotion outperforms previous methods across various metrics.

Wed 22 Oct. 14:15 - 16:15 PDT

#199
DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation

Donglin Di · He Feng · Wenzhang SUN · Yongjia Ma · Hao Li · Chen Wei · Lei Fan · Tonghua Su · Xun Yang

Human-centric generative models are becoming increasingly popular, giving rise to various innovative tools and applications, such as talking face videos conditioned on text or audio prompts. The core of these capabilities lies in powerful pretrained foundation models, trained on large-scale, high-quality datasets. However, many advanced methods rely on in-house data subject to various constraints, and other current studies fail to generate high-resolution face videos, which is mainly attributed to the significant lack of large-scale, high-quality face video datasets. In this paper, we introduce a human face video dataset, \textbf{DH-FaceVid-1K}. Our collection spans 1200 hours in total, encompassing 270,043 video samples from over 20,000 individuals. Each sample includes corresponding speech audio, facial keypoints, and text annotations. Compared to other publicly available datasets, ours distinguishes itself through its multi-ethnic coverage and high-quality comprehensive individual attributes. We establish multiple face video generation models supporting tasks such as text-to-video and image-to-video generation. In addition, we develop comprehensive benchmarks to validate the scaling law when using different proportions of our dataset. Our primary aim is to contribute a face video dataset, particularly addressing the underrepresentation of Asian faces in existing curated datasets and thereby enriching the global spectrum of face-centric data and mitigating demographic biases.

Wed 22 Oct. 14:15 - 16:15 PDT

#200
Synthetic Video Enhances Physical Fidelity in Video Synthesis

Qi Zhao · Xingyu Ni · Ziyu Wang · Feng Cheng · Ziyan Yang · Lu Jiang · Bohan Wang

We investigate how to enhance the physical fidelity of video generation models by leveraging synthetic videos generated via standard computer graphics techniques. These rendered videos respect real-world physics -- such as maintaining 3D consistency -- thereby serving as a valuable resource that can potentially improve video generation models.To harness this potential, we propose a solution that curates and integrates synthetic data while introducing a method to transfer its physical realism to the model, minimizing unwanted artifacts. Through experiments on three representative tasks emphasizing physical consistency, we demonstrate its effectiveness in enhancing physical fidelity. While our model still lacks a deep understanding of physics, our work offers one of the first empirical demonstrations that synthetic video enhances physical fidelity in video synthesis.

Wed 22 Oct. 14:15 - 16:15 PDT

#201
TimeBooth: Disentangled Facial Invariant Representation for Diverse and Personalized Face Aging

Zepeng Su · zhulin liu · Zongyan Zhang · Tong Zhang · C.L.Philip Chen

Face aging is a typical ill-posed problem influenced by various factors such as environment and genetics, leading to highly diverse outcomes. However, existing methods primarily rely on numerical age representations, making it difficult to accurately capture individual or group-level aging patterns. To address this, we introduce a novel disentangled face representation, where age features are modeled in the image modality—referred to as the Age Prompt—providing richer prior age information to constrain the generation results. To this end, we design an ID-age multi-task co-learning framework and propose the Bidirectional Adversarial Disentanglement(BAD) strategy. This strategy maximizes the disentanglement of ID and age representation through bidirectional adversarial learning, extracting their attribute-invariant representations. Based on this representation, we propose TimeBooth, a personalized face aging model capable of generating diverse and individualized aging results. To optimize training, we construct a cross-age hybrid data pipeline and introduce various training strategies. Finally, we propose the R-AgeMAE metric and validate our method through extensive experiments, demonstrating that TimeBooth outperforms existing methods in both diversity and controllability.

Wed 22 Oct. 14:15 - 16:15 PDT

#202
DanceEditor: Towards Iterative Editable Music-driven Dance Generation with Open-Vocabulary Descriptions

Hengyuan Zhang · Zhe Li · Xingqun Qi · Mengze Li · Muyi Sun · Siye Wang · Man Zhang · Sirui Han

Generating coherent and diverse human dances from music signals has gained tremendous progress in animating virtual avatars. While existing methods enable dance synthesis directly, they overlook affording editable dance movements for users is more practical in real choreography scenes.Moreover, the lack of high-quality dance datasets incorporating iterative editing also limits addressing this challenge.To achieve this goal, we first construct $\textbf{DanceRemix}$, a large-scale multi-turn editable dance dataset comprising the prompt featuring over 12.6M dance frames and 42K pairs.In addition, we propose a novel framework for iterative and editable dance generation coherently aligned with given music signals, namely $\textbf{DanceEditor}$. Considering the dance motion should be both musical rhythmic and enable iterative editing by user descriptions, our framework is built upon a prediction-then-editing paradigm unifying multi-modal conditions.At the initial prediction stage, our framework improves the authority of generated results by directly modeling dance movements from tailored aligned music.Moreover, at the subsequent iterative editing stages, we incorporate text descriptions as conditioning information to draw the editable results through a specific-designed $\textbf{Cross-modality Edition Module (CEM)}$.Specifically, CEM adaptively integrates the initial prediction with music and text prompts as temporal motion cues to guide the synthesized sequences.Thereby the results display music harmonic while preserving fine-grained semantic alignment with text descriptions.Extensive experiments demonstrate that our method outperforms the state-of-the-art models on our newly collected DanceRemix dataset.

Wed 22 Oct. 14:15 - 16:15 PDT

#203
Identity Preserving 3D Head Stylization with Multiview Score Distillation

Bahri Batuhan Bilecen · Ahmet Berke Gokmen · Furkan Güzelant · Aysegul Dundar

3D head stylization transforms realistic facial features into artistic representations, enhancing user engagement across applications such as gaming and virtual reality. While 3D-aware generators have made significant advancements, many 3D stylization methods primarily provide near-frontal views and struggle to preserve the unique identities of original subjects, often resulting in outputs that lack diversity and individuality. Leveraging the PanoHead model which provides 360-degree consistent renders, we propose a novel framework that employs negative log-likelihood distillation (LD) to enhance identity preservation and improve stylization quality. By integrating multi-view grid score and mirror gradients within the 3D GAN architecture and introducing a score rank weighing technique, our approach achieves substantial qualitative and quantitative improvements. Our findings not only advance the state of 3D head stylization but also provide valuable insights into effective distillation processes between diffusion models and GANs, focusing on the critical issue of identity preservation. Code will be publicly released.

Wed 22 Oct. 14:15 - 16:15 PDT

#204
IDF: Iterative Dynamic Filtering Networks for Generalizable Image Denoising

Dongjin Kim · Jaekyun Ko · Muhammad Kashif Ali · Tae Hyun Kim

Image denoising is a fundamental challenge in computer vision, with applications in photography and medical imaging. While deep learning–based methods have shown remarkable success, their reliance on specific noise distributions limits generalization to unseen noise types and levels. Existing approaches attempt to address this with extensive training data and high computational resources but still suffer from overfitting.To address these issues, we conduct image denoising utilizing dynamically generated kernels via efficient operations. This approach helps prevent overfitting and improve resilience to unseen noise. Repetition of this process greatly improves denoising performance. Our method leverages a Feature Extraction Module for robust noise-invariant features, and Global Statistics and Local Correlation Modules to capture comprehensive noise characteristics and structural correlations. The Kernel Prediction Module employs these cues to produce pixel-wise varying kernels adapted to local structures, which are then applied iteratively for denoising. This ensures both efficiency and superior restoration quality.Despite being trained on single-level Gaussian noise, our compact model ($\sim$ 0.04 M) excels across diverse noise types and levels, demonstrating the promise of iterative dynamic filtering for practical image denoising.

Wed 22 Oct. 14:15 - 16:15 PDT

#205
GDKVM: Echocardiography Video Segmentation via Spatiotemporal Key-Value Memory with Gated Delta Rule

Rui Wang · Yimu Sun · Jingxing Guo · Huisi Wu · Jing Qin

Accurate segmentation of cardiac chamber structures in echocardiogram sequences is of great significance for clinical diagnosis and treatment. The imaging noise, artifacts, and the deformation and motion of the heart pose challenges to segmentation algorithms. Existing methods based on convolutional neural networks, Transformers, and space-time memory have indeed improved segmentation accuracy to some extent, but they are often restricted by limited local receptive fields and insufficient temporal memory retrieval.In this paper, we propose a novel model for echocardiography video segmentation, called GDKVM. The model employs linear key-value associations (LKVA) to effectively model inter-frame correlations, and introduces the gated delta rule (GDR) to ideally store intermediate memory states. The key-pixel feature fusion (KPFF) module is designed to integrate local and global features at multiple scales, enhancing robustness against boundary blurring and noise interference. We validated GDKVM on two mainstream echocardiogram video datasets (CAMUS and EchoNet-Dynamic) and compared it with various state-of-the-art methods. Experimental results show that GDKVM outperforms existing approaches in terms of segmentation accuracy and robustness, while ensuring real-time performance. GDKVM provides more accurate and efficient cardiac chamber segmentation outcomes for clinical applications.The code will be released upon publication.

Wed 22 Oct. 14:15 - 16:15 PDT

#206
Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads

Yingjie Zhou · Jiezhang Cao · Zicheng Zhang · Farong Wen · Jiang Yanwei · Jun Jia · Xiaohong Liu · Xiongkuo Min · Guangtao Zhai

Speech-driven methods for portraits are figuratively known as ``Talkers" because of their capability to synthesize speaking mouth shapes and facial movements. Especially with the rapid development of the Text-to-Image (T2I) models, AI-Generated Talking Heads (AGTHs) have gradually become an emerging digital human media. However, challenges persist regarding the quality of these talkers and AGTHs they generate, and comprehensive studies addressing these issues remain limited. To address this gap, this paper \textbf{presents the largest AGTH quality assessment dataset THQA-10K} to date, which selects 12 prominent T2I models and 14 advanced talkers to generate AGTHs for 14 prompts. After excluding instances where AGTH generation is unsuccessful, the THQA-10K dataset contains 10,457 AGTHs, which provides rich material for AGTH quality assessment. Then, volunteers are recruited to subjectively rate the AGTHs and give the corresponding distortion categories. In our analysis for subjective experimental results, we evaluate the performance of talkers in terms of generalizability and quality, and also expose the distortions of existing AGTHs. Finally, \textbf{an objective quality assessment method based on the first frame, Y-T slice and tone-lip consistency is proposed}. Experimental results show that this method can achieve state-of-the-art (SOTA) performance in AGTH quality assessment. The work in this paper will be released.

Wed 22 Oct. 14:15 - 16:15 PDT

#207
Towards Efficient General Feature Prediction in Masked Skeleton Modeling

Shengkai Sun · Zefan Zhang · Jianfeng Dong · Zhiyong Cheng · Xiaojun Chang · Meng Wang

Recent advances in the masked autoencoder (MAE) paradigm have significantly propelled self-supervised skeleton-based action recognition. However, most existing approaches limit reconstruction targets to raw joint coordinates or their simple variants, resulting in computational redundancy and limited semantic representation. To address this, we propose a novel General Feature Prediction framework (GFP) for efficient mask skeleton modeling. Our key innovation is replacing conventional low-level reconstruction with high-level feature prediction that spans from local motion patterns to global semantic representations. Specifically, we introduce a collaborative learning framework where a lightweight target generation network dynamically produces diversified supervision signals across spatial-temporal hierarchies, avoiding reliance on pre-computed offline features. The framework incorporates constrained optimization to ensure feature diversity while preventing model collapse. Experiments on NTU RGB+D 60, NTU RGB+D 120 and PKU-MMD demonstrate the benefits of our approach: Computational efficiency (with 6.2$\times$ faster training than standard masked skeleton modeling methods) and superior representation quality, achieving state-of-the-art performance in various downstream tasks.

Wed 22 Oct. 14:15 - 16:15 PDT

#208
Scaling Action Detection: AdaTAD++ with Transformer-Enhanced Temporal-Spatial Adaptation

Tanay Agrawal · Abid Ali · Antitza Dantcheva · Francois Bremond

Temporal Action Detection (TAD) is essential for analyzing long-form videos by identifying and segmenting actions within untrimmed sequences. While recent innovations like Temporal Informative Adapters (TIA) have improved resolution, memory constraints still limit large video processing. To address this, we introduce AdaTAD++, an enhanced framework that decouples temporal and spatial processing within adapters, organizing them into independently trainable modules. Our novel two-step training strategy first optimizes for high temporal and low spatial resolution, then vice versa, allowing the model to utilize both high spatial and temporal resolutions during inference while maintaining training efficiency. Additionally, we incorporate a more sophisticated temporal module capable of capturing long-range dependencies more effectively than previous methods. Extensive experiments on benchmark datasets, including ActivityNet-1.3, THUMOS14, and EPIC-Kitchens 100, demonstrate that AdaTAD++ achieves state-of-the-art performance, surpassing existing methods in accuracy and efficiency. We also explore various adapter configurations, discussing their trade-offs regarding resource constraints and performance, providing valuable insights into their optimal application.

Wed 22 Oct. 14:15 - 16:15 PDT

#209
How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Indoor Scenes

Mahnoor Saad · Ziad Al-Halah

How would the sound in a studio change with a carpeted floor and acoustic tiles on the walls? We introduce the task of material-controlled acoustic profile generation, where, given an indoor scene with specific audio-visual characteristics, the goal is to generate a target acoustic profile based on a user-defined material configuration at inference time. We address this task with a novel encoder-decoder approach that encodes the scene's key properties from an audio-visual observation and generates the target Room Impulse Response (RIR) conditioned on the material specifications provided by the user. Our model enables the generation of diverse RIRs based on various material configurations defined dynamically at inference time. To support this task, we create a new benchmark, the Acoustic Wonderland Dataset, designed for developing and evaluating material-aware RIR prediction methods under diverse and challenging settings. Our results demonstrate that the proposed model effectively encodes material information and generates high-fidelity RIRs, outperforming several baselines and state-of-the-art methods. Code and dataset will be released publicly upon acceptance.

Wed 22 Oct. 14:15 - 16:15 PDT

#210
VideoSetDiff: Identifying and Reasoning Similarities and Differences in Similar Videos

YUE QIU · Yanjun Sun · Takuma Yagi · Shusaku Egami · Natsuki Miyata · Ken Fukuda · Kensho Hara · Ryusuke Sagawa

Recognizing subtle similarities and differences among sets of similar activities is central to many real-world applications, including skill acquisition, sports performance evaluation, and anomaly detection. Humans excel at such fine-grained analysis, which requires comprehensive video understanding and cross-video reasoning about action attributes, poses, positions, and emotional states. Yet existing video-based large language models typically address only single-video recognition, leaving their capacity for multi-video reasoning largely unexplored.We introduce VideoSetBench, a curated benchmark designed to test detail-oriented recognition across diverse activities, from subtle action attributes to viewpoint transitions. Our evaluation of current video-based LLMs on VideoSetBench reveals critical shortcomings, particularly in fine-grained detail recognition and multi-video reasoning. To mitigate these issues, we propose an automatically generated dataset for instruction tuning alongside a novel multi-video recognition framework. While instruction tuning and specialized multi-video reasoning improve performance, all tested models remain far from satisfactory. These findings underscore the need for more robust video-based LLMs capable of handling complex multi-video tasks, enabling diverse real-world applications.

Wed 22 Oct. 14:15 - 16:15 PDT

#211
MotionCtrl: A Real-time Controllable Vision-Language-Motion Model

Bin Cao · Sipeng Zheng · Ye Wang · Lujie Xia · Qianshan Wei · Qin Jin · Jing Liu · Zongqing Lu

Human motion generation holds significant potential for real-world applications. Despite recent advancements, existing vision-language-motion models (VLMMs) remain limited in achieving this goal. In this paper, we identify the lack of controllability as a critical bottleneck, where VLMMs struggle with diverse human commands, pose initialization, generation of long-term or unseen cases, and fine-grained control over individual body parts.To address these challenges, we introduce MotionCtrl, the first real-time, controllable VLMM with state-of-the-art performance.MotionCtrl achieves its controllability through training on HuMo100M, the largest human motion dataset to date, featuring over 5 million self-collected motions, 100 million multi-task instructional instances, and detailed part-level descriptions that address a long-standing gap in the field. Additionally, we propose a novel part-aware residual quantization technique for motion tokenization, enabling precise control over individual body parts during motion generation.Extensive experiments demonstrate MotionCtrl's superior performance across a wide range of motion benchmarks.Furthermore, we provide strategic design insights and a detailed time efficiency analysis to guide the development of practical motion generators. We believe the release of HuMo100M and MotionCtrl will significantly advance the motion community toward real-life applications. Code and data will be available at \url{https://anonymous.4open.science/r/MotionCtrl}.

Wed 22 Oct. 14:15 - 16:15 PDT

#212
Occlusion-robust Stylization for Drawing-based 3D Animation

Sunjae Yoon · Gwanhyeong Koo · Younghwan Lee · Ji Woo Hong · Chang Yoo

3D animation aims to generate a 3D animated video from an input image and a target 3D motion sequence. Recent advances in image-to-3D models enable the creation of animations directly from user-hand drawings. Distinguished from conventional 3D animation, drawing-based 3D animation is crucial to preserve artist's unique style properties, such as rough contours and distinct stroke patterns. However, recent methods still exhibit quality deterioration in these style properties, especially under occlusions caused by overlapping body parts, leading to contour flickering and stroke blurring. This occurs due to a `stylization pose gap' between training and inference in stylization networks designed to preserve drawing styles in drawing-based 3D animation systems. The stylization pose gap denotes that input target poses used to train the stylization network are always in occlusion-free poses, while target poses encountered in an inference include diverse occlusions under dynamic motions. To this end, we propose Occlusion-robust Stylization Framework (OSF) for drawing-based 3D animation. Our investigation found that while employing object's edge can be effective input prior for guiding stylization, it becomes notably inaccurate when occlusions occur at inference. Thus, our proposed OSF provides occlusion-robust edge guidance for stylization network using optical flow, which ensure a consistent stylization even under occlusions. Furthermore, OSF operates in a single run instead of the previous two-stage method, achieving 2.4$\times$ faster inference and 2.1$\times$ less memory. The code will be publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#213
Highlight
Rethinking Key-frame-based Micro-expression Recognition: A Robust and Accurate Framework Against Key-frame Errors

Zheyuan Zhang · Weihao Tang · Hong Chen

Micro-expression recognition (MER) is a highly challenging task in affective computing. With the reduced-sized micro-expression (ME) input that contains key information based on key-frame indexes, key-frame-based methods have significantly improved the performance of MER. However, most of these methods focus on improving the performance with relatively accurate key-frame indexes, while ignoring the difficulty of obtaining accurate key-frame indexes and the objective existence of key-frame index errors, which impedes them from moving towards practical applications. In this paper, we propose CausalNet, a novel framework to achieve robust MER facing key-frame index errors while maintaining accurate recognition. To enhance robustness, CausalNet takes the representation of the entire ME sequence as the input. To address the information redundancy brought by the complete ME range input and maintain accurate recognition, first, the Causal Motion Position Learning Module (CMPLM) is proposed to help the model locate the muscle movement areas related to Action Units (AUs), thereby reducing the attention to other redundant areas. Second, the Causal Attention Block (CAB) is proposed to deeply learn the causal relationships between the muscle contraction and relaxation movements in MEs. Moreover, due to its unique design, the model can maintain sensitivity to local information as the feature fusion deepens. Empirical experiments have demonstrated that on popular ME benchmarks, the CausalNet has achieved robust MER under different levels of key-frame index noise. Meanwhile, it has reached a new state-of-the-art (SOTA) level in standard MER tasks with the provided ground truth key-frames.

Wed 22 Oct. 14:15 - 16:15 PDT

#214
Highlight
Video Individual Counting for Moving Drones

Yaowu Fan · Jia Wan · Tao Han · Antoni Chan · Jinhua Ma

Video Individual Counting (VIC) has received increasing attentions recently due to its importance in intelligent video surveillance. Existing works are limited in two aspects, i.e., dataset and method. Previous crowd counting datasets are captured with fixed or rarely moving cameras with relatively sparse individuals, restricting evaluation for a highly varying view and time in crowded scenes. While VIC methods have been proposed based on localization-then-association or localization-then-classification, they may not perform well due to difficulty in accurate localization of crowded and small targets under challenging scenarios. To address these issues, we collect a MovingDroneCrowd Dataset and propose a density map based VIC method. Different from existing datasets, our dataset consists of videos captured by fast-moving drones in crowded scenes under diverse illuminations, shooting heights and angles. Other than localizing individuals, we propose a Depth-wise Cross-Frame Attention (DCFA) module, which directly estimate inflow and outflow density maps to learn shared density between consecutive frames. The inflow density maps across frames are summed up to obtain the number of unique pedestrians in a video. Experiments on our datasets and publicly available ones the the superiority of our method over the state of the arts for VIC in highly dynamic and complex crowded scenes. Our dataset and codes will be released publicly.

Wed 22 Oct. 14:15 - 16:15 PDT

#215
What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning

Chi-Hsi Kung · Frangil Ramirez · Juhyung Ha · Yi-Hsuan Tsai · Yi-Ting Chen · David Crandall

Understanding a procedural activity requires modeling both how action steps transform the scene, and how evolving scene transformations can influence the sequence of action steps, even those that are accidental or erroneous. Existing work has studied procedure-aware video representations by proposing novel approaches such as modeling the temporal order of actions and has not explicitly learned the state changes (scene transformations). In this work, we study procedure-aware video representation learning by incorporating state-change descriptions generated by Large Language Models (LLMs) as supervision signals for video encoders. Moreover, we generate state-change counterfactuals that simulate hypothesized failure outcomes, allowing models to learn by imagining the unseen ``What if'' scenarios. This counterfactual reasoning facilitates the model's ability to understand the cause and effect of each step in an activity. To verify the procedure awareness of our model, we conduct extensive experiments on procedure-aware tasks, including temporal action segmentation, error detection, and long-term action recognition. Our results demonstrate the effectiveness of the proposed state-change descriptions and their counterfactuals, and achieve significant improvements on multiple tasks. We will make our source code and data publicly available upon acceptance.

Wed 22 Oct. 14:15 - 16:15 PDT

#216
Devil is in the Uniformity: Exploring Diverse Learners within Transformer for Image Restoration

Shihao Zhou · Dayu Li · Jinshan Pan · Juncheng Zhou · Jinglei Shi · Jufeng Yang

Transformer-based approaches have gained significant attention in image restoration, where the core component, i.e, Multi-Head Attention (MHA), plays a crucial role in capturing diverse features and recovering high-quality results. In MHA, heads perform attention calculation independently from uniform split subspaces, and a redundancy issue is triggered to hinder the model from achieving satisfactory outputs. In this paper, we propose to improve MHA by exploring diverse learners and introducing various interactions between heads, which results in a Hierarchical multI-head atteNtion driven Transformer model, termed HINT, for image restoration. HINT contains two modules, i.e., the Hierarchical Multi-Head Attention (HMHA) and the Query-Key Cache Updating (QKCU) module, to address the redundancy problem that is rooted in vanilla MHA. Specifically, HMHA extracts diverse contextual features by employing heads to learn from subspaces of varying sizes and containing different information. Moreover, QKCU, comprising intra- and inter-layer schemes, further reduces the redundancy problem by facilitating enhanced interactions between attention heads within and across layers. Extensive experiments are conducted on 12 benchmarks across 5 image restoration tasks, including low-light enhancement, dehazing, desnowing, denoising, and deraining, to demonstrate the superiority of HINT. The source code is available in the supplementary materials.

Wed 22 Oct. 14:15 - 16:15 PDT

#217
HADES: Human Avatar with Dynamic Explicit Hair Strands

Zhanfeng Liao · Hanzhang Tu · Cheng Peng · Hongwen Zhang · Boyao Zhou · Yebin Liu

We introduce HADES, the first framework to seamlessly integrate dynamic hair into human avatars. HADES represents hair as strands bound to 3D Gaussians, with roots attached to the scalp.By modeling inertial and velocity-aware motion, HADES is able to simulate realistic hair dynamics that naturally align with body movements.To enhance avatar fidelity, we incorporate multi-scale data and address color inconsistencies across cameras using a lightweight MLP-based correction module, which generates color correction matrices for consistent color tones. Besides, we resolve rendering artifacts, such as hair dilation during zoom-out, through a 2D Mip filter and physically constrained hair radii. Furthermore, a temporal fusion module is introduced to ensure temporal coherence by modeling historical motion states. Experimental results demonstrate that HADES achieves high-fidelity avatars with physically plausible hair dynamics, outperforming existing state-of-the-art solutions in realism and robustness.

Wed 22 Oct. 14:15 - 16:15 PDT

#218
FlowDPS : Flow-Driven Posterior Sampling for Inverse Problems

Jeongsol Kim · Bryan Sangwoo Kim · Jong Ye

Flow matching is a recent state-of-the-art framework for generative modeling based on ordinary differential equations (ODEs). While closely related to diffusion models, it provides a more general perspective on generative modeling.Although inverse problem solving has been extensively explored using diffusion models, it has not been rigorously examined within the broader context of flow models. Therefore, here we extend the diffusion inverse solvers (DIS)— which perform posterior sampling by combining a denoising diffusion prior with an likelihood gradient—into the flow framework. Specifically, by driving the flow-version of Tweedie's formula, we decompose the flow ODE into two components: one for clean image estimation and the other for noise estimation.By integrating the likelihood gradient and stochastic noise into each component, respectively, we demonstrate that posterior sampling for inverse problem solving can be effectively achieved using flows. Our proposed solver, Flow-Driven Posterior Sampling (FlowDPS), can also be seamlessly integrated into a latent flow model with a transformer architecture. Across four linear inverse problems, we confirm that FlowDPS outperforms state-of-the-art alternatives, all without requiring additional training.

Wed 22 Oct. 14:15 - 16:15 PDT

#219
ZFusion: Efficient Deep Compositional Zero-shot Learning for Blind Image Super-Resolution with Generative Diffusion Prior

Alireza Esmaeilzehi · Hossein Zaredar · Yapeng Tian · Laleh Seyyed-Kalantari

Deep blind image super resolution (Blind SR) schemes strive to provide high performances under various image degradation processes. Despite the significant advancement in the area of Blind SR, the performances of these methods still may not be as high as one would desire in the case of real-world degradation operations. In this paper, we develop a novel diffusion-based Blind SR method, which, by leveraging compositional zero-shot learning, is able to provide superior performances for both synthetic and real-world unknown degradation processes. Specifically, we first extract both synthetic and real-world degradation embeddings from the input visual signal in a compositional zero-shot fashion. Next, we have efficiently embedded such degradation embeddings in the architecture of our diffusion-based scheme for guiding the diffusion feature generation process. The results of extensive experiments have demonstrated the effectiveness of the proposed Blind SR method over the state-of-the-art algorithms. Our source code and pre-trained models will be publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#220
Wavelet Policy: Lifting Scheme for Policy Learning in Long-Horizon Tasks

Hao Huang · Shuaihang Yuan · Geeta Chandra Raju Bethala · Congcong Wen · Anthony Tzes · Yi Fang

Policy learning focuses on devising strategies for agents in embodied AI systems to perform optimal actions based on their perceived states. One of the key challenges in policy learning involves handling complex, long-horizon tasks that require managing extensive sequences of actions and observations. Wavelet analysis offers significant advantages in signal processing, notably in decomposing signals at multiple scales to capture both global trends and fine-grained details. In this work, we introduce a novel wavelet policy learning framework that utilizes wavelet transformations to enhance policy learning. Our approach leverages multi-scale wavelet decomposition to facilitate detailed observation analysis and robust action planning over extended sequences. We detail the design and implementation of our wavelet policy, which incorporates lifting schemes for effective multi-resolution analysis and action generation. This framework is evaluated across multiple complex scenarios, including robotic manipulation and self-driving, demonstrating our method's effectiveness in improving the learned policy's precision and reliability.

Wed 22 Oct. 14:15 - 16:15 PDT

#221
VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior

Xindi Yang · Baolu Li · Yiming Zhang · Zhenfei Yin · LEI BAI · Liqian Ma · Zhiyong Wang · Jianfei Cai · Tien-Tsin Wong · Huchuan Lu · Xu Jia

Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. However, despite their capabilities, VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics, resulting in incorrect dynamics and event sequences. To address this limitation, we propose a novel two-stage image-to-video generation framework that explicitly incorporates physics. In the first stage, we employ a Vision Language Model (VLM) as a coarse-grained motion planner, integrating chain-of-thought and physics-aware reasoning to predict a rough motion trajectories/changes that approximate real-world physical dynamics while ensuring the inter-frame consistency. In the second stage, we use the predicted motion trajectories/changes to guide the video generation of a VDM. As the predicted motion trajectories/changes are rough, noise is added during inference to provide freedom to the VDM in generating motion with more fine details. Extensive experimental results demonstrate that our framework can produce physically plausible motion, and comparative evaluations highlight the notable superiority of our approach over existing methods.

Wed 22 Oct. 14:15 - 16:15 PDT

#222
StreamDiffusion: A Pipeline-level Solution for Real-Time Interactive Generation

Akio Kodaira · Chenfeng Xu · Toshiki Hazama · Takanori Yoshimoto · Kohei Ohno · Shogo Mitsuhori · Soichi Sugano · Hanying Cho · Zhijian Liu · Masayoshi Tomizuka · Kurt Keutzer

We introduce StreamDiffusion, a real-time diffusion pipeline designed for streaming image generation. Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction. This limitation becomes particularly evident in scenarios involving continuous input, such as augmented/virtual reality, video game graphics rendering, live video streaming, and broadcasting, where high throughput is imperative. StreamDiffusion tackles this challenge through a novel pipeline-level system design. It employs unique strategies like batching the denoising process (Stream Batch), residual classifier-free guidance(R-CFG), and stochastic similarity filtering (SSF). Additionally, it seamlessly integrates advanced acceleration technologies for maximum efficiency. Specifically, Stream Batch reformulates the denoising process by eliminating the traditional wait-and-execute approach and utilizing a batching denoising approach, facilitating fluid and high-throughput streams. This results in 1.5x higher throughput compared to the conventional sequential denoising approach. R-CFG significantly addresses inefficiencies caused by repetitive computations during denoising. It optimizes the process to require minimal or no additional computations, leading to speed improvements of up to 2.05x compared to previous classifier-free methods. Besides, our stochastic similarity filtering dramatically lowers GPU activation frequency by halting computations for static image flows, achieving a remarkable reduction in computational consumption—2.39 times on an RTX 3060 GPU and 1.99 times on an RTX 4090 GPU, respectively. The synergy of our proposed strategies with established acceleration technologies enables image generation to reach speeds of up to 91.07 fps on a single RTX 4090 GPU, significantly outperforming the throughput of AutoPipeline, developed by Diffusers, by more than 59.56 times.

Wed 22 Oct. 14:15 - 16:15 PDT

#223
HairCUP: Hair Compositional Universal Prior for 3D Gaussian Avatars

Byungjun Kim · Shunsuke Saito · Giljoo Nam · Tomas Simon · Jason Saragih · Hanbyul Joo · Junxuan Li

We present a universal prior model for 3D head avatar with hair compositionality. Existing approaches for building generalizable prior for 3D head avatar often model face and hair in a monolithic manner, where the inherent compositonality of the human head and hair is not considered. It is especially challenging for the monolithic model to self-discover the compositionality of face and hair when the dataset is not large enough. Moreover, extending the monolithic model for applications like swapping faces or hairstyles in 3D is not straightforward. Our prior model explicitly accounts for the compositionality of face and hair, learning their priors separately. To learn a disentangled latent spaces of face and hair of 3D head avatars, we propose a synthetic hairless data creation pipeline for dehairing the studio-captured dataset with estimated hairless geometry and hairless texture obtained from diffusion prior. Using a paired dataset of hair and hairless captures, disentangled prior models for face and hair can be trained by leveraging compositionality as an inductive bias to achieve disentanglement. Our model's inherent compositionality enables a seamless transfer of face and hair components between avatars while maintaining the subject's identity. Furthermore, we demonstrate that our model can be finetuned with a monocular capture to create hair-compositional 3D head avatars for unseen subjects, highlighting the practical applicability of our prior model in real-world scenarios.

Wed 22 Oct. 14:15 - 16:15 PDT

#224
Learning Streaming Video Representation via Multitask Training

Yibin Yan · Jilan Xu · Shangzhe Di · Yikun Liu · Yudi Shi · Qirui Chen · Zeqian Li · Yifei Huang · Weidi Xie

Understanding continuous video streams plays a fundamental role in real-time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency decisions.To address these challenges, our main contributions are three-fold. (i) We develop a novel streaming video backbone, termed as StreamFormer, by incorporating causal temporal attention into a pre-trained vision transformer. This enables efficient streaming video processing while maintaining image representation capability. (ii) To train StreamFormer, we propose to unify diverse spatial-temporal video understanding tasks within a multitask visual-language alignment framework. Hence, StreamFormer learns global semantics, temporal dynamics, and fine-grained spatial relationships simultaneously. (iii) We conduct extensive experiments on online action detection, online video instance segmentation, and video question answering. StreamFormer achieves competitive results while maintaining efficiency, demonstrating its potential for real-time applications.

Wed 22 Oct. 14:15 - 16:15 PDT

#225
DreamRelation: Relation-Centric Video Customization

Yujie Wei · Shiwei Zhang · Hangjie Yuan · Biao Gong · Longxiang Tang · Xiang Wang · Haonan Qiu · Hengjia Li · Shuai Tan · Yingya Zhang · Hongming Shan

Relational video customization refers to the creation of personalized videos that depict user-specified relations between two subjects, a crucial task for comprehending real-world visual content. While existing methods can personalize subject appearances and motions, they still struggle with complex relational video customization, where precise relational modeling and high generalization across subject categories are essential. The primary challenge arises from the intricate spatial arrangements, layout variations, and nuanced temporal dynamics inherent in relations; consequently, current models tend to overemphasize irrelevant visual details rather than capturing meaningful interactions. To address these challenges, we propose $\textbf{DreamRelation}$, a novel approach that personalizes relations through a small set of exemplar videos, leveraging two key components: Relational Decoupling Learning and Relational Dynamics Enhancement. First, in Relational Decoupling Learning, we disentangle relations from subject appearances using relation LoRA triplet and hybrid mask training strategy, ensuring better generalization across diverse relationships. Furthermore, we determine the optimal design of relation LoRA triplet by analyzing the distinct roles of the query, key, and value features within MM-DiT's attention mechanism, making DreamRelation the first relational video generation framework with explainable components. Second, in Relational Dynamics Enhancement, we introduce space-time relational contrastive loss, which prioritizes relational dynamics while minimizing the reliance on detailed subject appearances. Extensive experiments demonstrate that DreamRelation outperforms state-of-the-art methods in relational video customization. Code and models will be made publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#226
ModSkill: Physical Character Skill Modularization

Yiming Huang · Zhiyang Dou · Lingjie Liu

Human motion is highly diverse and dynamic, posing challenges for imitation learning algorithms that aim to generalize motor skills for controlling simulated characters. Prior methods typically rely on a universal full-body controller for tracking reference motion (tracking-based model) or a unified full-body skill embedding space (skill embedding). However, these approaches often struggle to generalize and scale to larger motion datasets. In this work, we introduce a novel skill learning framework, ModSkill, that decouples complex full-body skills into compositional, modular skills for independent body parts, leveraging body structure-inspired inductive bias to enhance skill learning performance. Our framework features a skill modularization attention mechanism that processes policy observations into modular skill embeddings that guide low-level controllers for each body part. We further propose an Active Skill Learning approach with Generative Adaptive Sampling, using large motion generation models to adaptively enhance policy learning in challenging tracking scenarios. Results show that this modularized skill learning framework, enhanced by generative sampling, outperforms existing methods in precise full-body motion tracking and enables reusable skill embeddings for diverse goal-driven tasks. Our code will be released publicly upon publication.

Wed 22 Oct. 14:15 - 16:15 PDT

#227
Stable Virtual Camera: Generative View Synthesis with Diffusion Models

Jensen Zhou · Hang Gao · Vikram Voleti · Aaryaman Vasishta · Chun-Han Yao · Mark Boss · Philip Torr · Christian Rupprecht · Varun Jampani

We present $\underline{\text{S}}$tabl$\underline{\text{e}}$ $\underline{\text{V}}$irtual C$\underline{\text{a}}$mera (Seva), a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras.Existing works struggle to generate either large viewpoint changes or temporally smooth samples, while relying on specific task configurations.Our approach overcomes these limitations through simple model design, optimized training recipe, and flexible sampling strategy that generalize across view synthesis tasks at test time.As a result, our samples maintain high consistency without requiring additional 3D representation-based distillation, thus streamlining view synthesis in the wild.Furthermore, we show that our method can generate high-quality videos lasting up to half a minute with seamless loop closure.Extensive benchmarking demonstrates that Seva outperforms existing methods across different datasets and settings.

Wed 22 Oct. 14:15 - 16:15 PDT

#228
Highlight
Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training

Qiaosi Yi · Shuai Li · Rongyuan Wu · Lingchen Sun · Yuhui WU · Lei Zhang

Impressive results on real-world image super-resolution (Real-ISR) have been achieved by employing pre-trained stable diffusion (SD) models. However, one well-known yet critical issue of such methods lies in their poor reconstruction of image fine structures, such as small characters and textures, due to the aggressive resolution reduction of the VAE (e.g., 8$\times$ downsampling) in the SD model. One solution is to employ a VAE with a lower downsampling rate for diffusion; however, adapting its latent features with the pre-trained UNet to preserve the diffusion prior while mitigating the increased computational cost poses new challenges. To address these issues, we propose a transfer VAE training (TVT) strategy to transfer the 8$\times$ downsampled VAE into a 4$\times$ one while preserving the pre-trained diffusion prior. Specifically, we first train a 4$\times$ decoder based on the output features of the original VAE encoder, then train a 4$\times$ encoder while keeping the newly trained decoder fixed. Such a TVT strategy helps align the new encoder-decoder pair with the original VAE latent space while enhancing image fine details. Additionally, we introduce a compact VAE and compute-efficient UNet by optimizing their network architectures, reducing the overall computational cost while effectively capturing high-resolution fine-scale features. Experimental results demonstrate that our TVT method significantly improves fine-structure preservation, which is often compromised by other SD-based methods, while requiring fewer FLOPs than state-of-the-art one-step diffusion models. Codes and models will be released.

Wed 22 Oct. 14:15 - 16:15 PDT

#229
Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework

Jian-Jian Jiang · Xiao-Ming Wu · Yi-Xiang He · Ling-An Zeng · Yilin Wei · Dandan Zhang · Wei-Shi Zheng

Bimanual robotic manipulation is an emerging and critical topic in the robotics community. Previous works primarily rely on integrated control models that take the perceptions and states of both arms as inputs to directly predict their actions. However, we think bimanual manipulation involves not only coordinated tasks but also various uncoordinated tasks that do not require explicit cooperation during execution, such as grasping objects with the closest hand, which integrated control frameworks ignore to consider due to their enforced cooperation in the early inputs. In this paper, we propose a novel decoupled interaction framework that considers the characteristics of different tasks in bimanual manipulation. The key insight of our framework is to assign an independent model to each arm to enhance the learning of uncoordinated tasks, while introducing a selective interaction module that adaptively learns weights from its own arm to improve the learning of coordinated tasks. Extensive experiments on seven tasks in the RoboTwin dataset demonstrate that: (1) Our framework achieves outstanding performance, with a 23.5% boost over the SOTA method. (2) Our framework is flexible and can be seamlessly integrated into existing methods. (3) Our framework can be effectively extended to multi-agent manipulation tasks, achieving a 28% boost over the integrated control SOTA. (4) The performance boost stems from the decoupled design itself, surpassing the SOTA by 16.5% in success rate with only 1/6 of the model size.

Wed 22 Oct. 14:15 - 16:15 PDT

#230
LA-MOTR: End-to-End Multi-Object Tracking by Learnable Association

Peng Wang · Yongcai Wang · Hualong Cao · Wang Chen · Deying Li

This paper proposes LA-MOTR, a novel Tracking-by-Learnable-Association framework that resolves the competing optimization objectives between detection and association in end-to-end Tracking-by-Attention (TbA) Multi-Object Tracking. Current TbA methods employ shared decoders for simultaneous object detection and tracklet association, which often results in task interference and suboptimal accuracy. By contrast, our end-to-end framework decouples these tasks into two specialized modules: Separated Object-Tracklet Detection (SOTD) and Spatial-Guided Learnable Association (SGLA). This decoupled design offers flexibility and explainability. In particular, SOTD independently detects new objects and existing tracklets in each frame, while SGLA associates them via Spatial-Weighted Learnable Attention module guided by relative spatial cues. Temporal coherence is further maintained through Tracklet Updates Module. The learnable association mechanism resolves the inherent suboptimal association issues in decoupled frameworks, avoiding the task interference commonly observed in joint approaches. Evaluations on DanceTrack, MOT17, and SportMOT datasets demonstrate state-of-the-art performance. Extensive ablation studies validate the effectiveness of the designed modules. Code will be publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#231
Free-Form Motion Control: Controlling the 6D Poses of Camera and Objects in Video Generation

Xincheng Shuai · Henghui Ding · Zhenyuan Qin · Hao Luo · Xingjun Ma · Dacheng Tao

Controlling the movements of dynamic objects and the camera within generated videos is a meaningful yet challenging task. Due to the lack of datasets with comprehensive 6D pose annotations, existing text-to-video methods can not simultaneously control the motions of both camera and objects in 3D-aware manner, resulting in limited controllability over generated contents. To address this issue and facilitate the research in this field, we introduce a Synthetic Dataset for Free-Form Motion Control (SynFMC). The proposed SynFMC dataset includes diverse object and environment categories and covers various motion patterns according to specific rules, simulating common and complex real-world scenarios. The complete 6D pose information facilitates models learning to disentangle the motion effects from objects and the camera in a video. To provide precise 3D-aware motion control, we further propose a method trained on SynFMC, Free-Form Motion Control (FMC). FMC can control the 6D poses of objects and camera independently or simultaneously, producing high-fidelity videos. Moreover, it is compatible with various personalized text-to-image (T2I) models for different content styles. Extensive experiments demonstrate that the proposed FMC outperforms previous methods across multiple scenarios.

Wed 22 Oct. 14:15 - 16:15 PDT

#232
Learning A Unified Template for Gait Recognition

Panjian Huang · Saihui Hou · Junzhou Huang · Yongzhen Huang

``What I cannot create, I do not understand.'' Human wisdom reveals that creation is one of the highest forms of learning. For example, Diffusion Models have demonstrated remarkable semantic structural and memory capabilities in image generation, denoising, and restoration, which intuitively benefits representation learning. However, current gait networks rarely embrace this perspective, relying primarily on learning by contrasting gait samples under varying complex conditions, leading to semantic inconsistency and uniformity issues. To address these issues, we propose Origins with generative capabilities whose underlying philosophy is that different entities are generated from a unified template, inherently regularizing gait representations within a consistent and diverse semantic space to capture differences accurately. Admittedly, learning this unified template is exceedingly challenging, as it requires the comprehensiveness of the template to encompass gait representations with various conditions. Inspired by Diffusion Models, Origins diffuses the unified template into timestep templates for gait generative modeling, and meanwhile transfers the unified template for gait representation learning. Especially, gait generative modeling and representation learning serve as a unified framework to end-to-end joint training. Extensive experiments on CASIA-B, CCPG, SUSTech1K, Gait3D, GREW and CCGR-MINI demonstrate that Origins performs representation learning within a unified template, achieving superior performance.

Wed 22 Oct. 14:15 - 16:15 PDT

#233
Highlight
Sliced Wasserstein Bridge for Open-Vocabulary Video Instance Segmentation

Zheyun Qin · Deng Yu · Chuanchen Luo · Zhumin Chen

In recent years, researchers have explored the task of open-vocabulary video instance segmentation, which aims to identify, track, and segment any instance within an open set of categories. The core challenge of Open-Vocabulary VIS lies in solving the cross-domain alignment problem, including spatial-temporal and text-visual domain alignments. Existing methods have made progress but still face shortcomings in addressing these alignments, especially due to data heterogeneity. Inspired by metric learning, we propose an innovative Sliced Wasserstein Bridging Learning Framework. This framework utilizes the Sliced Wasserstein distance as the core tool for metric learning, effectively bridging the four domains involved in the task. Our innovations are threefold: (1) Domain Alignment: By mapping features from different domains into a unified metric space, our method maintains temporal consistency and learns intrinsic consistent features between modalities, improving the fusion of text and visual information. (2) Weighting Mechanism: We introduce an importance weighting mechanism to enhance the discriminative ability of our method when dealing with imbalanced or significantly different data. (3) High Efficiency: Our method inherits the computational efficiency of the Sliced Wasserstein distance, allowing for online processing of large-scale video data while maintaining segmentation accuracy. Through extensive experimental evaluations, we have validated the robustness of our concept and the effectiveness of our framework.

Wed 22 Oct. 14:15 - 16:15 PDT

#234
Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers

Yunshan Zhong · Yuyao Zhou · Yuxin Zhang · Wanchen Sui · Shen Li · Yong Li · Fei Chao · Rongrong Ji

Data-free quantization (DFQ) enables model quantization without accessing real data, addressing concerns regarding data security and privacy. With the growing adoption of Vision Transformers (ViTs), DFQ for ViTs has garnered significant attention. However, existing DFQ methods exhibit two limitations: (1) semantic distortion, where the semantics of synthetic images deviate substantially from those of real images, and (2) semantic inadequacy, where synthetic images contain extensive regions with limited content and oversimplified textures, leading to suboptimal quantization performance. To address these limitations, we propose SARDFQ, a novel Semantics Alignment and Reinforcement Data-Free Quantization method for ViTs. To address semantic distortion, SARDFQ incorporates Attention Priors Alignment (APA), which optimizes synthetic images to follow randomly generated structure attention priors. To mitigate semantic inadequacy, SARDFQ introduces Multi-Semantic Reinforcement (MSR), leveraging localized patch optimization to enhance semantic richness across synthetic images. Furthermore, SARDFQ employs Soft-Label Learning (SL), wherein multiple semantic targets are adapted to facilitate the learning of multi-semantic images augmented by MSR. Extensive experiments demonstrate the effectiveness of SARDFQ, significantly surpassing existing methods. For example, SARDFQ improves top-1 accuracy on ImageNet by 15.52% for W4A4 ViT-B

Wed 22 Oct. 14:15 - 16:15 PDT

#236
Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis

Bowen Zhang · Sicheng Xu · Chuxin Wang · Jiaolong Yang · Feng Zhao · Dong Chen · Baining Guo

In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content.

Wed 22 Oct. 14:15 - 16:15 PDT

#237
Synchronization of Multiple Videos

Avihai Naaman · Ron Shapira Weber · Oren Freifeld

Synchronizing multiple videos depicting the same action is straightforward when recorded from a single scene with multiple cameras, often reducible to a simple time-axis shift. However, in-the-wild scenarios and, more recently, multiple generative AI–produced videos pose a far more complex challenge due to diverse subjects, backgrounds, and nonlinear temporal misalignments. We propose Temporal Prototype Learning (TPL), a prototype-based framework that constructs a shared, compact 1D representation from high-dimensional embeddings extracted by any off-the-shelf model. TPL robustly aligns videos—whether real-world or generative—by learning a unified prototype sequence that anchors key action phases, thereby avoiding exhaustive pairwise matching. Our experiments show that TPL offers improved synchronization accuracy, efficiency, and robustness across diverse datasets, including fine-grained frame retrieval and phase classification tasks. Crucially, TPL is the first approach to mitigate out-of-sync issues for multiple generative AI videos of the same action. We will release our code upon acceptance.

Wed 22 Oct. 14:15 - 16:15 PDT

#238
DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis

Yinqi Cai · Jichang Li · Zhaolun Li · Weikai Chen · Rushi Lan · Xi Xie · Xiaonan Luo · Guanbin Li

Recent advances in deep generative models have made it easier to manipulate face videos, raising significant concerns about their potential misuse for fraud and misinformation. Existing detectors often perform well in in-domain scenarios but fail to generalize across diverse manipulation techniques due to their reliance on forgery-specific artifacts. In this work, we introduce DeepShield, a novel deepfake detection framework that balances local sensitivity and global generalization to improve robustness across unseen forgeries. DeepShield enhances the CLIP-ViT encoder through two key components: Local Patch Guidance (LPG) and Global Forgery Diversification (GFD). LPG applies spatiotemporal artifact modeling and patch-wise supervision to capture fine-grained inconsistencies often overlooked by global models. GFD introduces domain feature augmentation, leveraging domain-bridging and boundary-expanding feature generation to synthesize diverse forgeries, mitigating overfitting and enhancing cross-domain adaptability. Through the integration of novel local and global analysis for deepfake detection, DeepShield outperforms state-of-the-art methods in cross-dataset and cross-manipulation evaluations, achieving superior robustness against unseen deepfake attacks.

Wed 22 Oct. 14:15 - 16:15 PDT

#239
Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions

Liang Xu · Chengqun Yang · Zili Lin · Fei Xu · Yifan Liu · Congsheng Xu · Yiyi Zhang · Jie Qin · Xingdong Sheng · Yunhui Liu · Xin Jin · Yichao Yan · Wenjun Zeng · Xiaokang Yang

Learning action models from real-world human-centric interaction datasets is important towards building general-purpose intelligent assistants with efficiency. However, most existing datasets only offer specialist interaction category and ignore that AI assistants perceive and act based on first-person acquisition. We urge that both the generalist interaction knowledge and egocentric modality are indispensable. In this paper, we embed the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands. With our hybrid RGB-MoCap system, pairs of assistants and instructors engage with multiple objects and the scene following GPT-generated scripts. Under this setting, we accomplish InterVLA, the first large-scale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data, spanning 2 egocentric and 5 exocentric videos, accurate human/object motions and verbal commands. Furthermore, we establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis. We believe that our InterVLA testbed and the benchmarks will foster future works on building AI agents in the physical world.

Wed 22 Oct. 14:15 - 16:15 PDT

#240
Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation

Fating Hong · Zunnan Xu · Zixiang Zhou · Jun Zhou · Xiu Li · Qin Lin · Qinglin Lu · Dan Xu

Talking head synthesis is vital for virtual avatars and human-computer interaction. However, most existing methods are typically limited to accepting control from a single primary modality, restricting their practical utility. To this end, we introduce ACTalker, an end-to-end video diffusion framework that supports both multi-signals control and single-signal control for talking head video generation. For multiple control, we design a parallel mamba structure with multiple branches, each utilizing a separate driving signal to control specific facial regions. A gate mechanism is applied across all branches, providing flexible control over video generation. To ensure natural coordination of the controlled video both temporally and spatially, we employ the mamba structure, which enables driving signals to manipulate feature tokens across both dimensions in each branch. Additionally, we introduce a mask-drop strategy that allows each driving signal to independently control its corresponding facial region within the mamba structure, preventing control conflicts. Experimental results demonstrate that our method produces natural-looking facial videos driven by diverse signals and that the mamba layer seamlessly integrates multiple driving modalities without conflict.

Wed 22 Oct. 14:15 - 16:15 PDT

#241
Highlight
SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation

Shiqi Huang · Shuting He · Huaiyuan Qin · Bihan Wen

Most existing remote sensing instance segmentation approaches are designed for close-vocabulary prediction, limiting their ability to recognize novel categories or generalize across datasets. This restricts their applicability in diverse Earth observation scenarios. To address this, we introduce open-vocabulary (OV) learning for remote sensing instance segmentation. While current OV segmentation models perform well on natural image datasets, their direct application to remote sensing faces challenges such as diverse landscapes, seasonal variations, and the presence of small or ambiguous objects in aerial imagery. To overcome these challenges, we propose SCORE (Scene Context matters in Open-vocabulary REmote sensing instance segmentation), a framework that integrates multi-granularity scene context, i.e., regional context and global context, to enhance both visual and textual representations. Specifically, we introduce Region-Aware Integration, which refines class embeddings with regional context to improve object distinguishability. Additionally, we propose Global Context Adaptation, which enriches naive text embeddings with remote sensing global context, creating a more adaptable and expressive linguistic latent space for the classifier. We establish new benchmarks for OV remote sensing instance segmentation across diverse datasets. Experimental results demonstrate that, our proposed method achieves SOTA performance, which provides a robust solution for large-scale, real-world geospatial analysis.

Wed 22 Oct. 14:15 - 16:15 PDT

#242
VertexRegen: Mesh Generation with Continuous Level of Detail

Xiang Zhang · Yawar Siddiqui · Armen Avetisyan · Christopher Xie · Jakob Engel · Henry Howard-Jenkins

We introduce VertexRegen, a novel mesh generation framework that enables generation at a continuous level of detail. Existing autoregressive methods generate meshes in a partial-to-complete manner and thus intermediate steps of generation represent incomplete structures. VertexRegen, takes inspiration from progressive meshes and reformulates the process as the reversal of edge collapse, i.e. vertex split, learned through a generative model. Experimental results demonstrate that VertexRegen produces meshes of comparable quality to state-of-the-art methods while uniquely offering anytime generation with the flexibility to halt at any step to yield valid meshes with varying levels of detail.

Wed 22 Oct. 14:15 - 16:15 PDT

#243
Multimodal Prompt Alignment for Facial Expression Recognition

Fuyan Ma · Yiran He · Bin Sun · Shutao Li

Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propose a multimodal prompt alignment framework for FER, called MPA-FER, that provides fine-grained semantic guidance to the learning process of prompted visual features, resulting in more precise and interpretable representations. Specifically, we introduce a multi-granularity hard prompt generation strategy that utilizes a large language model (LLM) like ChatGPT to generate detailed descriptions for each facial expression. The LLM-based external knowledge is injected into the soft prompts by minimizing the feature discrepancy between the soft prompts and the hard prompts. To preserve the generalization abilities of the pretrained CLIP model, our approach incorporates prototype-guided visual feature alignment, ensuring that the prompted visual features from the frozen image encoder align closely with class-specific prototypes. Additionally, we propose a cross-modal global-local alignment module that focuses on expression-relevant facial features, further improving the alignment between textual and visual features. Extensive experiments demonstrate our framework outperforms state-of-the-art methods on three FER benchmark datasets, while retaining the benefits of the pretrained model and minimizing computational costs.

Wed 22 Oct. 14:15 - 16:15 PDT

#244
Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting

Jiaxin Huang · Sheng Miao · Bangbang Yang · Yuewen Ma · Yiyi Liao

Reconstructing 4D dynamic scenes from casually captured monocular videos is valuable but highly challenging, as each timestamp is observed from a single viewpoint. We introduce Vivid4D, a novel approach that enhances 4D monocular video synthesis by augmenting observation views — synthesizing multi-view videos from a monocular input. Unlike existing methods that either solely leverage geometric priors for supervision or use generative priors while overlooking geometry, we integrate both. This reformulates view augmentation as a video inpainting task, where observed views are warped into new viewpoints based on monocular depth priors. To achieve this, we train a video inpainting model on unposed web videos with synthetically generated masks that mimic warping occlusions, ensuring spatially and temporally consistent completion of missing regions. To further mitigate inaccuracies in monocular depth priors, we introduce an iterative view augmentation strategy and a robust reconstruction loss. Experiments demonstrate that our method effectively improves monocular 4D scene reconstruction and completion.

Wed 22 Oct. 14:15 - 16:15 PDT

#245
AU-Blendshape for Fine-grained Stylized 3D Facial Expression Manipulation

Hao Li · Ju Dai · Feng Zhou · Kaida Ning · Lei Li · Junjun Pan

While 3D facial animation has made impressive progress, challenges still exist in realizing fine-grained stylized 3D facial expression manipulation due to the lack of appropriate datasets. In this paper, we introduce the AUBlendSet, a 3D facial dataset based on AU-Blendshape representation for fine-grained facial expression manipulation across identities. AUBlendSet is a blendshape data collection based on 32 standard facial action units (AUs) across 500 identities, along with an additional set of facial postures annotated with detailed AUs. Based on AUBlendSet, we propose AUBlendNet to learn AU-Blendshape basis vectors for different character styles. AUBlendNet predicts, in parallel, the AU-Blendshape basis vectors of the corresponding style for a given identity mesh, thereby achieving stylized 3D emotional facial manipulation. We comprehensively validate the effectiveness of AUBlendSet and AUBlendNet through tasks such as stylized facial expression manipulation, speech-driven emotional facial animation, and emotion recognition data augmentation. Through a series of qualitative and quantitative experiments, we demonstrate the potential and importance of AUBlendSet and AUBlendNet in 3D facial animation tasks. To the best of our knowledge, AUBlendSet is the first dataset, and AUBlendNet is the first network for continuous 3D facial expression manipulation for any identity through facial AUs.

Wed 22 Oct. 14:15 - 16:15 PDT

#246
GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation

Quanwei Yang · Luying Huang · Kaisiyuan Wang · Jiazhi Guan · Shengyi He · Fengguo Li · Hang Zhou · Lingyun Yu · Yingying Li · Haocheng Feng · Hongtao Xie

While increasing attention has been paid to human gesture synthesis, most previous works concentrate on holistic body movements without investigating hand gestures with explicit and essential semantics. In this paper, we study co-speech gesture generation with an emphasis on specific hand gesture activation, which can deliver more instructional information than common body movements. To achieve this, we first build a high-quality dataset of 3D human body movements including a set of semantically explicit hand gestures that are commonly used by live streamers. Then we present a hybrid-modality gesture generation system built upon hybrid-modality diffusion transformer architecture with novelly designed motion-style injective transformer layers, which enables advanced gesture modeling ability and versatile gesture operations. To guarantee these specific hand gestures can be activated, we introduce a cascaded retrieval-augmented generation strategy built upon a semantic gesture repository annotated for each subject and an adaptive audio-gesture synchronization mechanism, which substantially improves semantic gesture activation and production efficiency.Quantitative and qualitative experiments demonstrate that our proposed approach achieves superior performance over all the counterparts.

Wed 22 Oct. 14:15 - 16:15 PDT

#247
FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration

Hao Li · Xiang Chen · Jiangxin Dong · Jinhui Tang · Jinshan Pan

Despite the significant progress made by all-in-one models in universal image restoration, existing methods suffer from a generalization bottleneck in real-world scenarios, as they are mostly trained on small-scale synthetic datasets with limited degradations. Therefore, large-scale high-quality real-world training data is urgently needed to facilitate the emergence of foundation models for image restoration. To advance this field, we spare no effort in contributing a million-scale dataset with two notable advantages over existing training data: larger-scale real-world samples, and higher-diversity data types. By adjusting internal camera settings and external imaging conditions, we can capture aligned image pairs using our well-designed data acquisition system over multiple rounds and our data alignment criterion. Moreover, we propose a robust model, FoundIR, to better address a broader range of restoration tasks in real-world scenarios, taking a further step toward foundation models. Specifically, we first utilize a diffusion-based generalist model to remove degradations by learning the degradation-agnostic common representations from diverse inputs, where incremental learning strategy is adopted to better guide model training. To refine the model's restoration capability in complex scenarios, we introduce degradation-aware specialist models for achieving final high-quality results. Extensive experiments show the value of our dataset and the effectiveness of our method.

Wed 22 Oct. 14:15 - 16:15 PDT

#248
Highlight What You Want: Weakly-Supervised Instance-Level Controllable Infrared-Visible Image Fusion

Zeyu Wang · Jizheng Zhang · Haiyu Song · Mingyu Ge · Jiayu Wang · Haoran Duan

Infrared and visible image fusion (VIS-IR) aims to integrate complementary information from both source images to produce a fused image with enriched details. However, most existing fusion models lack controllability, making it difficult to customize the fused output according to user preferences. To address this challenge, we propose a novel weakly-supervised, instance-level controllable fusion model that adaptively highlights user-specified instances based on input text. Our model consists of two stages: pseudo-label generation and fusion network training. In the first stage, guided by observed multimodal manifold priors, we leverage text and manifold similarity as joint supervisory signals to train text-to-image response network (TIRN) in a weakly-supervised manner, enabling it to identify referenced semantic-level objects from instance segmentation outputs. To align text and image features in TIRN, we propose a multimodal feature alignment module (MFA), using manifold similarity to guide attention weight assignment for precise correspondence between image patches and text embeddings. Moreover, we employ spatial positional relationships to accurately select the referenced instances from multiple semantic-level objects. In the second stage, the fusion network takes source images and text as input, using the generated pseudo-labels for supervision to apply distinct fusion strategies for target and non-target regions. Experimental results show that our model not only generates precise pseudo-labels but also achieves state-of-the-art fusion performance while highlighting user-defined instances. Our code will be publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#249
Adaptive Hyper-Graph Convolution Network for Skeleton-based Human Action Recognition with Virtual Connections

Youwei Zhou · Tianyang Xu · Cong Wu · Xiaojun Wu · Josef Kittler

The shared topology of human skeletons motivated the recent investigation of graph convolutional network (GCN) solutions for action recognition.However, most of the existing GCNs rely on the binary connection of two neighboring vertices (joints) formed by an edge (bone), overlooking the potential of constructing multi-vertex convolution structures.Although some studies have attempted to utilize hyper-graphs to represent the topology, they rely on a fixed construction strategy, which limits their adaptivity in uncovering the intricate latent relationships within the action.In this paper, we address this oversight and explore the merits of an adaptive hyper-graph convolutional network (Hyper-GCN) to achieve the aggregation of rich semantic information conveyed by skeleton vertices.In particular, our Hyper-GCN adaptively optimises the hyper-graphs during training, revealing the action-driven multi-vertex relations. Besides, virtual connections are often designed to support efficient feature aggregation, implicitly extending the spectrum of dependencies within the skeleton.By injecting virtual connections into hyper-graphs, the semantic clues of diverse action categories can be highlighted. The results of experiments conducted on the NTU-60, NTU-120, and NW-UCLA datasets demonstrate the merits of our Hyper-GCN, compared to the state-of-the-art methods.Specifically, we outperform the existing solutions on NTU-120, achieving 90.5\% and 91.7\% in terms of the top-1 recognition accuracy on X-Sub and X-Set.

Wed 22 Oct. 14:15 - 16:15 PDT

#250
Weakly Supervised Visible-Infrared Person Re-Identification via Heterogeneous Expert Collaborative Consistency Learning

Yafei Zhang · Lingqi Kong · Huafeng Li · Jie Wen

To reduce the reliance of visible-infrared person re-identification (ReID) models on labeled cross-modal samples, this paper explores a weakly supervised cross-modal person ReID method that uses only single-modal sample identity labels, addressing scenarios where cross-modal identity labels are unavailable. To mitigate the impact of missing cross-modal labels on model performance, we propose a heterogeneous expert collaborative consistency learning framework, designed to establish robust cross-modal identity correspondences in a weakly supervised manner. This framework leverages labeled data from each modality to independently train dedicated classification experts. To associate cross-modal samples, these classification experts act as heterogeneous predictors, predicting the identities of samples from the other modality. To improve prediction accuracy, we design a cross-modal relationship fusion mechanism that effectively integrates predictions from different experts. Under the implicit supervision provided by cross-modal identity correspondences, collaborative and consistent learning among the experts is encouraged, significantly enhancing the model’s ability to extract modality-invariant features and improve cross-modal identity recognition. Experimental results on two challenging datasets validate the effectiveness of the proposed method.

Wed 22 Oct. 14:15 - 16:15 PDT

#251
PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image

Geonhee Sim · Gyeongsik Moon

Two major approaches exist for creating animatable human avatars. The first, a 3D-based approach, optimizes a NeRF- or 3DGS-based avatar from videos of a single person, achieving personalization through a disentangled identity representation. However, modeling pose-driven deformations, such as non-rigid cloth deformations, requires numerous pose-rich videos, which are costly and impractical to capture in daily life. The second, a diffusion-based approach, learns pose-driven deformations from large-scale in-the-wild videos but struggles with identity preservation and pose-dependent identity entanglement. We present PERSONA, a framework that combines the strengths of both approaches to obtain a personalized 3D human avatar with pose-driven deformations from a single image. PERSONA leverages a diffusion-based approach to generate pose-rich videos from the input image and optimizes a 3D avatar based on them. To ensure high authenticity and sharp renderings across diverse poses, we introduce balanced sampling and geometry-weighted optimization. Balanced sampling oversamples the input image to mitigate identity shifts in diffusion-generated training videos. Geometry-weighted optimization prioritizes geometry constraints over image loss, preserving rendering quality in diverse poses. Code and weights will be publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#252
Highlight
Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology

Siyuan Yan · Ming Hu · Yiwen Jiang · Xieji Li · Hao Fei · Philipp Tschandl · Harald Kittler · Zongyuan Ge

The emergence of vision-language models has transformed medical AI, enabling unprecedented advances in diagnostic capability and clinical applications. However, progress in dermatology has lagged behind other medical domains due to the lack of standard image-text pairs. Existing dermatological datasets are limited in both scale and depth, offering only single-label annotations across a narrow range of diseases instead of rich textual descriptions, and lacking the crucial clinical context needed for real-world applications. To address these limitations, we present Derm1M, the first large-scale vision-language dataset for dermatology, comprising 1,029,761 image-text pairs. Built from diverse educational resources and structured around a standard ontology collaboratively developed by experts, Derm1M provides comprehensive coverage for over 390 skin conditions across four hierarchical levels and 130 clinical concepts with rich contextual information such as medical history, symptoms, and skin tone. To demonstrate Derm1M’s potential in advancing both AI research and clinical application, we pretrained a series of CLIP-like models, collectively called DermLIP, on this dataset. The DermLIP family significantly outperforms state-of-the-art foundation models on eight diverse datasets across multiple tasks, including zero-shot skin disease classification, clinical and artifacts concept identification, few-shot/full-shot learning, and cross-modal retrieval. Our dataset and code will be public.

Wed 22 Oct. 14:15 - 16:15 PDT

#253
FaceLift: Learning Generalizable Single Image 3D Face Reconstruction from Synthetic Heads

Weijie Lyu · Yi Zhou · Ming-Hsuan Yang · Zhixin Shu

We present $\textit{FaceLift}$, a novel feed-forward approach for generalizable high-quality 360-degree 3D head reconstruction from a single image. Our pipeline first employs a multi-view latent diffusion model to generate consistent side and back views from a single facial input, which then feed into a transformer-based reconstructor that produces a comprehensive 3D Gaussian Splats representation. Previous methods for monocular 3D face reconstruction often lack full view coverage or view consistency due to insufficient multi-view supervision. We address this by creating a high-quality synthetic head dataset that enables consistent supervision across viewpoints. To bridge the domain gap between synthetic training data and real-world images, we propose a simple yet effective technique that ensures the view generation process maintains fidelity to the input by learning to reconstruct the input image alongside the view generation. Despite being trained exclusively on synthetic data, our method demonstrates remarkable generalization to real-world images. Through extensive qualitative and quantitative evaluations, we show that $\textit{FaceLift}$ outperforms state-of-the-art 3D face reconstruction methods on identity preservation, detail recovery and rendering quality.

Wed 22 Oct. 14:15 - 16:15 PDT

#254
DexH2R: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover

Youzhuo Wang · jiayi ye · Chuyang Xiao · Yiming Zhong · Heng Tao · Hang Yu · Yumeng Liu · Jingyi Yu · Yuexin Ma

Handover between a human and a dexterous robotic hand is a fundamental yet challenging task in human-robot collaboration. It requires handling dynamic environments and a wide variety of objects, and demands robust and adaptive grasping strategies. However, progress in developing effective dynamic dexterous grasping methods is limited by the absence of high-quality, real-world human-to-robot handover datasets. Existing datasets primarily focus on grasping static objects or rely on synthesized handover motions, which differ significantly from real-world robot motion patterns, creating a substantial gap in applicability.In this paper, we introduce DexH2R, a comprehensive real-world dataset for human-to-robot handovers, built on dexterous robotic hand. Our dataset captures a diverse range of interactive objects, dynamic motion patterns, rich visual sensor data, and detailed annotations. Additionally, to ensure natural and human-like dexterous motions, we utilize teleoperation for data collection, enabling the robot’s movements to align with human behaviors and habits, which is a crucial characteristic for intelligent humanoid robots.Furthermore, we propose an effective solution, DynamicGrasp, for human-to-robot handover and evaluate various state-of-the-art approaches, including auto-regressive models and diffusion policy methods, providing a thorough comparison and analysis. We believe our benchmark will drive advancements in human-to-robot handover research by offering a high-quality dataset, effective solutions, and comprehensive evaluation metrics.

Wed 22 Oct. 14:15 - 16:15 PDT

#255
Precise Action-to-Video Generation Through Visual Action Prompts

Yuang Wang · Chao Wen · Haoyu Guo · Sida Peng · Minghan Qin · Hujun Bao · Ruizhen Hu · Xiaowei Zhou

We present visual action prompts, an unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality tradeoff: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for its generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources -- human-object interactions (HOI) and dexterous robotic manipulation -- enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach.

Wed 22 Oct. 14:15 - 16:15 PDT

#256
PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning

Yan Zhang · Yao Feng · Alpár Cseke · Nitin Saini · Nathan Bajandas · Nicolas Heron · Michael Black

To build a motor system of the interactive avatar, it is essential to develop a generative motion model, which at least can drive the body to move in 3D space in a perpetual, realistic, controllable, and responsive manner. Although motion generation has been extensively studied in the past, most methods can be hardly regarded as embodied intelligence, due to their offline setting, slow speed, limited motion lengths, unnaturalness, and more. To overcome these limitations, we propose PRIMAL, an autoregressive diffusion model that is learned with a two-stage paradigm, inspired by recent advances of foundation models. In the pretraining stage, we let the model concentrate on learning motion dynamics from a large number of sub-second motion segments. In the adaptation phase, we propose a generic ControlNet-like adaptor, and fine-tune it on semantic action generation and spatial target reaching. Experiments show that physics effects emerge in our results. Given a single-frame initial state, our model not only generates unbounded, realistic, and controllable motion, but also enables the avatar to be responsive to induced impulses in real time. In addition, we can effectively and efficiently adapt our base model to few-shot personalized actions and the task of spatial control. Evaluations show that our proposed methods outperform state-of-the-art baselines. Based on these advantages, we build a real-time character animation system in Unreal Engine, making them ``alive''.

Wed 22 Oct. 14:15 - 16:15 PDT

#257
MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization

Hengjia Li · Lifan Jiang · Xi Xiao · Tianyang Wang · Hongwei Yi · Boxi Wu · Deng Cai

Video identity customization seeks to produce high-fidelity videos that maintain consistent identity and exhibit significant dynamics based on users' reference images. However, existing approaches face two key challenges: identity degradation over extended video length and reduced dynamics during training, primarily due to their reliance on traditional self-reconstruction training with static images. To address these issues, we introduce $\textbf{MagicID}$, a novel framework designed to directly promote the generation of identity-consistent and dynamically rich videos tailored to user preferences. Specifically, we propose constructing pairwise preference video data with explicit identity and dynamic rewards for preference learning, instead of sticking to the traditional self-reconstruction. To address the constraints of customized preference data, we introduce a hybrid sampling strategy. This approach first prioritizes identity preservation by leveraging static videos derived from reference images, then enhances dynamic motion quality in the generated videos using a Frontier-based sampling method. By utilizing these hybrid preference pairs, we optimize the model to align with the reward differences between pairs of customized preferences. Extensive experiments show that MagicID successfully achieves consistent identity and natural dynamics, surpassing existing methods across various metrics.

Wed 22 Oct. 14:15 - 16:15 PDT

#258
Consistency Trajectory Matching for One-Step Generative Super-Resolution

Weiyi You · Mingyang Zhang · Leheng Zhang · Xingyu Zhou · Kexuan Shi · Shuhang Gu

Current diffusion-based super-resolution (SR) approaches achieve commendable performance at the cost of high inference overhead. Therefore, distillation techniques are utilized to accelerate the multi-step teacher model into one-step student model. Nevertheless, these methods significantly raise training costs and constrain the performance of the student model by the teacher model. To overcome these tough challenges, we propose Consistency Trajectory Matching for Super-Resolution (CTMSR), a distillation-free strategy that is able to generate photo-realistic SR results in one step. Concretely, we first formulate a Probability Flow Ordinary Differential Equation (PF-ODE) trajectory to establish a deterministic mapping from low-resolution (LR) images with noise to high-resolution (HR) images. Then we apply the Consistency Training (CT) strategy to directly learn the mapping in one step, eliminating the necessity of pre-trained diffusion model. To further enhance the performance and better leverage the ground-truth during the training process, we aim to align the distribution of SR results more closely with that of the natural images. To this end, we propose to minimize the discrepancy between their respective PF-ODE trajectories from the LR image distribution by our meticulously designed Distribution Trajectory Matching (DTM) loss, resulting in improved realism of our recovered HR images. Comprehensive experimental results demonstrate that the proposed methods can attain comparable or even superior capabilities on both synthetic and real datasets while maintaining minimal inference latency.

In zero-shot skeleton-based action recognition (ZSAR), aligning skeleton features with the text features of action labels is essential for accurately predicting unseen actions. ZSAR faces a fundamental challenge in bridging the modality gap between the two-kind features, which severely limits generalization to unseen actions. Previous methods focus on direct alignment between skeleton and text latent spaces, but the modality gaps between these spaces hinder robust generalization learning. Motivated by the success of diffusion models in multi-modal alignment (e.g., text-to-image, text-to-video), we firstly present a diffusion-based skeleton-text alignment framework for ZSAR. Our approach, Triplet Diffusion for Skeleton-Text Matching (TDSM), focuses on cross-alignment power of diffusion models rather than their generative capability. Specifically, TDSM aligns skeleton features with text prompts by incorporating text features into the reverse diffusion process, where skeleton features are denoised under text guidance, forming a unified skeleton-text latent space for robust matching. To enhance discriminative power, we introduce a triplet diffusion (TD) loss that encourages our TDSM to correct skeleton-text matches while pushing them apart for different action classes. Our TDSM significantly outperforms very recent state-of-the-art methods with significantly large margins of 2.36\%-point to 13.05\%-point, demonstrating superior accuracy and scalability in zero-shot settings through effective skeleton-text matching.

Wed 22 Oct. 14:15 - 16:15 PDT

#260
Highlight
Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images

Boyang Deng · Kyle Genova · Songyou Peng · Gordon Wetzstein · Noah Snavely · Leonidas Guibas · Thomas Funkhouser

We present a system using Multimodal LLMs (MLLMs) to analyze a large database with tens of millions of images captured at different times, with the aim of discovering patterns in temporal changes. Specifically, we aim to capture frequent co-occurring changes ("trends") across a city over a certain period. Unlike previous visual analyses, our analysis answers open-ended queries (e.g., "what are the frequent types of changes in the city?") without any predetermined target subjects or training labels. These properties cast prior learning-based or unsupervised visual analysis tools unsuitable. We identify MLLMs as a novel tool for their open-ended semantic understanding capabilities. Yet, our datasets are four orders of magnitude too large for an MLLM to injest as context. So we introduce a bottom-up procedure that decomposes the massive visual analysis problem into more tractable sub-problems. We carefully design MLLM-based solutions to each sub-problem. During experiments and ablation studies with our system, we find it significantly outperforms baselines and is able to discover interesting trends from images captured in large cities (e.g., "addition of outdoor dining,", "overpass was painted blue," etc.).

Wed 22 Oct. 14:15 - 16:15 PDT

#261
Latent-Reframe: Enabling Camera Control for Video Diffusion Models without Training

Zhenghong Zhou · Jie An · Jiebo Luo

Precise camera pose control is crucial for video generation with diffusion models. Existing methods require fine-tuning with additional datasets containing paired videos and camera pose annotations, which are both data-intensive and computationally costly, and may disrupt the model's distribution learned from the training data. We introduce Latent-Reframe, which enables camera control in a pre-trained video diffusion model without fine-tuning. Unlike existing methods, Latent-Reframe operates during the sampling stage, maintaining efficiency while preserving the distribution learned during pretraining. Our approach reframes the latent code of video frames to align with the input camera trajectory through time-aware point clouds. Latent code inpainting and harmonization then refine the model’s latent space, ensuring high-quality video generation. Latent-Reframe can be applied to both DiT- and UNet-based video diffusion models. Experimental results demonstrate that Latent-Reframe can achieve comparable or superior camera control precision and video quality to training-based methods, without the need for fine-tuning on additional datasets. Please open video_results.html in supplementary material to view generated videos.

Wed 22 Oct. 14:15 - 16:15 PDT

#262
Neuromanifold-Regularized KANs for Shape-fair Feature Representations

Mazlum Arslan · Weihong Guo · Shuo Li

Traditional deep networks struggle to acquire shape-fair representations due to their high expressivity. Kolmogorov-Arnold Networks (KANs) are promising candidates as they learn nonlinearities directly, a property that makes them more adaptive. However, KANs perform suboptimally in terms of shape-fairness because of unconstrained nonlinearities, a limitation we demonstrate for the first time. On the other hand, shape-fair networks reside on a neuromanifold of low-degree. Motivated by this, we investigate neuromanifold regularization of KANs to enable learning of shape-fair feature representations. The proposed method, NeuroManifold Regularized-KANs, is a novel regularization that addresses failure modes during the acquisition of local and global shape cues, separately. This is done by constraining the degree of the neuromanifolds of two jointly trained feature extractors. Additionally, we propose a novel Style Decorrelation Loss that promotes decorrelation of intermediate representations. Our experiments demonstrate that NMR-KAN improves shape bias over baseline convolutional KANs by 14.8\% while also providing robustness under image corruptions and adversarial attacks.

Wed 22 Oct. 14:15 - 16:15 PDT

#263
Highlight
Learning to Generalize without Bias for Open-Vocabulary Action Recognition

Yating Yu · Congqi Cao · Yifan Zhang · Yanning Zhang

Leveraging the effective visual-text alignment and static generalizability from CLIP, recent video learners adopt CLIP initialization with further regularization or recombination for generalization in open-vocabulary action recognition in-context. However, due to the static bias of CLIP, such video learners tend to overfit on shortcut static features, thereby compromising their generalizability, especially to novel out-of-context actions. To address this issue, we introduce $\textbf{Open-MeDe}$, a novel Meta-optimization framework with static Debiasing for Open-vocabulary action recognition. From a fresh perspective of generalization, Open-MeDe adopts a meta-learning approach to improve $\textbf{known-to-open generalizing}$ and $\textbf{image-to-video debiasing}$ in a cost-effective manner. Specifically, Open-MeDe introduces a cross-batch meta-optimization scheme that explicitly encourages video learners to quickly generalize to arbitrary subsequent data via virtual evaluation, steering a smoother optimization landscape. In effect, the free of CLIP regularization during optimization implicitly mitigates the inherent static bias of the video meta-learner. We further apply self-ensemble over the optimization trajectory to obtain generic optimal parameters that can achieve robust generalization to both in-context and out-of-context novel data. Extensive evaluations show that Open-MeDe not only surpasses state-of-the-art regularization methods tailored for in-context open-vocabulary action recognition but also substantially excels in out-of-context scenarios.

Wed 22 Oct. 14:15 - 16:15 PDT

#264
GeoAvatar: Adaptive Geometrical Gaussian Splatting for 3D Head Avatar

SeungJun Moon · Hah Min Lew · Seungeun Lee · Ji-Su Kang · Gyeong-Moon Park

Despite recent progress in 3D head avatar generation, balancing identity preservation, i.e., reconstruction, with novel poses and expressions, i.e., animation, remains a challenge. Existing methods struggle to adapt Gaussians to varying geometrical deviations across facial regions, resulting in suboptimal quality. To address this, we propose GeoAvatar, a framework for adaptive geometrical Gaussian Splatting. GeoAvatar leverages Adaptive Geometrical Initialization (AGI), an unsupervised method that segments Gaussians into rigid and flexible sets for adaptive offset regularization. Then, based on mouth anatomy and dynamics, we introduce a novel mouth structure and the part-wise deformation strategy to enhance the animation fidelity of the mouth. Finally, we propose a regularization loss for precise rigging between Gaussians and 3DMM faces. Moreover, we release DynamicFace, a video dataset with highly expressive facial motions. Extensive experiments show the superiority of GeoAvatar compared to state-of-the-art methods in reconstruction and novel animation scenarios. The dataset and pre-trained models will be released after the review.

Wed 22 Oct. 14:15 - 16:15 PDT

#265
MotionFollower: Editing Video Motion via Score-Guided Diffusion

Shuyuan Tu · Qi Dai · Zihao Zhang · Sicheng Xie · Zhi-Qi Cheng · Chong Luo · Xintong Han · Zuxuan Wu · Yu-Gang Jiang

Despite impressive advancements in diffusion-based video editing models in altering video attributes, there has been limited exploration into modifying motion information while preserving the original protagonist's appearance and background. In this paper, we propose MotionFollower, a score-guided diffusion model for video motion editing. To introduce conditional controls to the denoising process, we propose two signal controllers, one for poses and the other for appearances, both consist of convolution blocks without involving heavy attention calculations. Further, we design a score guidance principle based on a two-branch architecture (a reconstruction and an editing branch), significantly enhancing the modeling capability of texture details and complicated backgrounds. Concretely, we enforce several consistency regularizers during the score estimation. The resulting gradients thus inject appropriate guidance to latents, forcing the model to preserve the original background details and protagonists' appearances without interfering with the motion modification. Experiments demonstrate MotionFollower's competitive motion editing ability qualitatively and quantitatively. Compared with MotionEditor, the most advanced motion editing model, MotionFollower delivers superior motion editing performance and exclusively supports large camera movements. To the best of our knowledge, MotionFollower is the first diffusion model to explore score regularization in video editing.

Wed 22 Oct. 14:15 - 16:15 PDT

#266
InterSyn: Interleaved Learning for Dynamic Motion Synthesis in the Wild

Yiyi Ma · Yuanzhi Liang · Xiu Li · Chi Zhang · Xuelong Li

We present Interleaved Learning for Motion Synthesis (InterSyn), a novel framework that targets the generation of realistic interaction motions by learning from integrated motions that consider both solo and multi-person dynamics. Unlike previous methods that treat these components separately, InterSyn employs an interleaved learning strategy to capture the natural, dynamic interactions and nuanced coordination inherent in real-world scenarios. Our framework comprises two key modules: the Interleaved Interaction Synthesis (INS) module, which jointly models solo and interactive behaviors in a unified paradigm from a first-person perspective to support multiple character interactions, and the Relative Coordination Refinement (REC) module, which refines mutual dynamics and ensures synchronized motions among characters. Experimental results show that the motion sequences generated by InterSyn exhibit higher text-to-motion alignment and improved diversity compared with recent methods, setting a new benchmark for robust and natural motion synthesis.

Wed 22 Oct. 14:15 - 16:15 PDT

#267
MoFRR: Mixture of Diffusion Models for Face Retouching Restoration

Jiaxin Liu · Qichao Ying · Zhenxing Qian · Sheng Li · Runqi Zhang · Jian liu · Xinpeng Zhang

The widespread use of face retouching on social media platforms raises concerns about the authenticity of face images. While existing methods focus on detecting face retouching, how to accurately recover the original faces from the retouched ones has yet to be answered. This paper introduces Face Retouching Restoration (FRR), a novel computer vision task aimed at restoring original faces from their retouched counterparts. FRR differs from traditional image restoration tasks by addressing the complex retouching operations with various types and degrees, which focuses more on the restoration of the low-frequency information of the faces. To tackle this challenge, we propose MoFRR, Mixture of Diffusion Models for FRR. Inspired by DeepSeek's expert isolation strategy, the MoFRR uses sparse activation of specialized experts handling distinct retouching types and the engagement of a shared expert dealing with universal retouching traces. Each specialized expert follows a dual-branch structure with a DDIM-based low-frequency branch guided by an Iterative Distortion Evaluation Module (IDEM) and a Cross-Attention-based High-Frequency branch (HFCAM) for detail refinement. Extensive experiments on a newly constructed face retouching dataset, RetouchingFFHQ++, demonstrate the effectiveness of MoFRR for FRR.

Wed 22 Oct. 14:15 - 16:15 PDT

#268
D3: Training-Free AI-Generated Video Detection Using Second-Order Features

Chende Zheng · Ruiqi suo · Chenhao Lin · Zhengyu Zhao · Le Yang · Shuai Liu · Minghui Yang · Cong Wang · Chao Shen

The evolution of video generation techniques, such as Sora, has made it increasingly easy to produce high-fidelity AI-generated videos, raising public concern over the dissemination of synthetic content. However, existing detection methodologies remain limited by their insufficient exploration of temporal artifacts in synthetic videos. To bridge this gap, we establish a theoretical framework through second-order dynamical analysis under Newtonian mechanics, subsequently extending the Second-order Central Difference features tailored for temporal artifact detection. Building on this theoretical foundation, we reveal a fundamental divergence in second-order feature distributions between real and AI-generated videos. Concretely, we propose Detection by Difference of Differences (D3), a novel training-free detection method that leverages the above second-order temporal discrepancies. We validate the superiority of our D3 on 4 open-source datasets (Gen-Video, VideoPhy, EvalCrafter, VidProM), 40 subsets in total. For example, on GenVideo, D3 outperforms the previous best method by 10.39\% (absolute) mean Average Precision. Additional experiments on time cost and post-processing operations demonstrate D3's exceptional computational efficiency and strong robust performance.

Wed 22 Oct. 14:15 - 16:15 PDT

#269
Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution

Vlad Hosu · Lorenzo Agnolucci · Daisuke Iso · Dietmar Saupe

Image Quality Assessment (IQA) measures and predicts perceived image quality by human observers. Although recent studies have highlighted the critical influence that variations in the scale of an image have on its perceived quality, this relationship has not been systematically quantified.To bridge this gap, we introduce the Image Intrinsic Scale (IIS), defined as the largest scale where an image exhibits its highest perceived quality. We also present the Image Intrinsic Scale Assessment (IISA) task, which involves subjectively measuring and predicting the IIS based on human judgments. We develop a subjective annotation methodology and create the IISA-DB dataset, comprising 785 image-IIS pairs annotated by experts in a rigorously controlled crowdsourcing study with verified reliability. Furthermore, we propose WIISA (Weak-labeling for Image Intrinsic Scale Assessment), a strategy that leverages how the IIS of an image varies with downscaling to generate weak labels. Experiments show that applying WIISA during the training of several IQA methods adapted for IISA consistently improves the performance compared to using only ground-truth labels. We will release the code, dataset, and pre-trained models upon acceptance.

Wed 22 Oct. 14:15 - 16:15 PDT

#270
Frequency-Guided Posterior Sampling for Diffusion-Based Image Restoration

Darshan Thaker · Abhishek Goyal · Rene Vidal

Image restoration aims to recover high-quality images from degraded observations. When the degradation process is known, the recovery problem can be formulated as an inverse problem, and in a Bayesian context, the goal is to sample a clean reconstruction given the degraded observation. Recently, modern pretrained diffusion models have been used for image restoration by modifying their sampling procedure to account for the degradation process. However, these methods often rely on certain approximations that can lead to significant errors and compromised sample quality. In this paper, we propose a simple modification to existing diffusion-based restoration methods that exploits the frequency structure of the reverse diffusion process. Specifically, our approach, denoted as Frequency Guided Posterior Sampling (FGPS), introduces a time-varying low-pass filter in the frequency domain of the measurements, progressively incorporating higher frequencies during the restoration process. We provide the first rigorous analysis of the approximation error of FGPS for linear inverse problems under distributional assumptions on the space of natural images, demonstrating cases where previous works can fail dramatically. On real-world data, we develop an adaptive curriculum for our method's frequency schedule based on the underlying data distribution. FGPS significantly improves performance on challenging image restoration tasks including motion deblurring and image dehazing.

Wed 22 Oct. 14:15 - 16:15 PDT

#271
GAS: Generative Avatar Synthesis from a Single Image

Yixing Lu · Junting Dong · YoungJoong Kwon · Qin Zhao · Bo Dai · Fernando De la Torre

We introduce a generalizable and unified framework to synthesize view-consistent and temporally coherent avatars from a single image, addressing the challenging problem of single-image avatar generation. While recent methods employ diffusion models conditioned on human templates like depth or normal maps, they often struggle to preserve appearance information due to the discrepancy between sparse driving signals and the actual human subject, resulting in multi-view and temporal inconsistencies. Our approach bridges this gap by combining the reconstruction power of regression-based 3D human reconstruction with the generative capabilities of a diffusion model. The dense driving signal from the initial reconstructed human provides comprehensive conditioning, ensuring high-quality synthesis faithful to the reference appearance and structure. Additionally, we propose a unified framework that enables the generalization learned from novel pose synthesis on in-the-wild videos to naturally transfer to novel view synthesis. Our video-based diffusion model enhances disentangled synthesis with high-quality view-consistent renderings for novel views and realistic non-rigid deformations in novel pose animation. Results demonstrate the superior generalization ability of our method across in-domain and out-of-domain in-the-wild datasets.

Wed 22 Oct. 14:15 - 16:15 PDT

#272
Less Static, More Private: Towards Transferable Privacy-Preserving Action Recognition by Generative Decoupled Learning

Zhi-Wei Xia · Kun-Yu Lin · Yuan-Ming Li · Wei-Jin Huang · Xian-Tuo Tan · Wei-Shi Zheng

This work focuses on the task of privacy-preserving action recognition, which aims to protect individual privacy in action videos without compromising recognition performance. Despite recent advancements, existing privacy-preserving action recognition models still struggle with video domain shifts. To address this challenge, this work aims to develop transferable privacy-preserving action recognition models, by leveraging labeled videos from the source domain and unlabeled videos from the target domain. This work contributes a novel method named GenPriv, which improves the transferability of privacy-preserving models by generative decoupled learning. Inspired by the fact that privacy-sensitive information in action videos primarily comes from the static human appearances, our GenPriv decouples video features into static and dynamic aspects and then removes privacy-sensitive content from static action features.We propose a generative architecture named ST-VAE, complemented by Spatial Consistency and Temporal Alignment losses, to enhance decoupled learning. Experimental results on three benchmarks with diverse domain shifts demonstrate the effectiveness of our proposed GenPriv.

Wed 22 Oct. 14:15 - 16:15 PDT

#273
Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video

Xiao Li · Qi Chen · Xiulian Peng · Kai Yu · Xie Chen · Yan Lu

We propose a novel and general framework to disentangle video data into its dynamic motion and static content components. Our proposed method is a self-supervised pipeline with less assumptions and inductive biases than previous works: it utilizes a transformer-based architecture to jointly generate flexible implicit features for frame-wise motion and clip-wise content, and incorporates a low-bitrate vector quantization as an information bottleneck to promote disentanglement and form a meaningful discrete motion space. The bitrate-controlled latent motion and content are used as conditional inputs to a denoising diffusion model to facilitate self-supervised representation learning. We validate our disentangled representation learning framework on real world talking head videos with motion transfer and auto-regressive motion generation tasks. Furthermore, we also show that our method can generalize to other type of video data, such as pixel sprites of 2D cartoon characters. Our work presents a new perspective on self-supervised learning of disentangled video representations, contributing to the broader field of video analysis and generation.

Wed 22 Oct. 14:15 - 16:15 PDT

#274
MR-FIQA: Face Image Quality Assessment with Multi-Reference Representations from Synthetic Data Generation

Fu-Zhao Ou · Chongyi Li · Shiqi Wang · Sam Kwong

Recent advancements in Face Image Quality Assessment (FIQA) models trained on real large-scale face datasets are pivotal in guaranteeing precise face recognition in unrestricted scenarios. Regrettably, privacy concerns lead to the discontinuation of real datasets, underscoring the pressing need for a tailored synthetic dataset dedicated to the FIQA task. However, creating satisfactory synthetic datasets for FIQA is challenging. It requires not only controlling the intra-class degradation of different quality factors (e.g., pose, blur, occlusion) for the pseudo-identity generation but also designing an optimized quality characterization method for quality annotations. This paper undertakes the pioneering initiative to establish a Synthetic dataset for FIQA (SynFIQA) based on a hypothesis: accurate quality labeling can be achieved through the utilization of quality priors across the diverse domains involved in quality-controllable generation. To validate this, we tailor the generation of reference and degraded samples by aligning pseudo-identity image features in stable diffusion latent space, editing 3D facial parameters, and customizing dual text prompts and post-processing. Furthermore, we propose a novel quality characterization method that thoroughly examines the relationship of Multiple Reference representations among recognition embedding, spatial, and visual-language domains to acquire annotations essential for fitting FIQA models (MR-FIQA). Extensive experiments confirm the validity of our hypothesis and demonstrate the advantages of our SynFIQA data and MR-FIQA method. Our dataset, source code, and models will be publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#275
Text-to-Any-Skeleton Motion Generation Without Retargeting

Qingyuan Liu · Ke Lv · Kun Dong · Jian Xue · Zehai Niu · Jinbao Wang

Recent advances in text-driven motion generation have shown notable advancements. However, these works are typically limited to standardized skeletons and rely on a cumbersome retargeting process to adapt to varying skeletal configurations of diverse characters. In this paper, we present OmniSkel, a novel framework that can directly generate high-quality human motions for any user-defined skeleton without retargeting. Specifically, we introduce skeleton-aware RVQ-VAE, which utilizes Kinematic Graph Cross Attention (K-GCA) to effectively integrate skeletal information into the motion encoding and reconstruction. Moreover, we propose a simple yet effective training-free approach, Motion Restoration Optimizer (MRO), to ensure zero bone length error while preserving motion smoothness. To facilitate our research, we construct SkeleMotion-3D, a large-scale text-skeleton-motion dataset based on HumanML3D. Extensive experiments demonstrate the excellent robustness and generalization of our method.The dataset and source code will be made public upon acceptance of this paper.

Wed 22 Oct. 14:15 - 16:15 PDT

#276
Blind2Sound: Self-Supervised Image Denoising without Residual Noise

Jiazheng Liu · Zejin Wang · Bohao Chen · Hua Han

Self-supervised blind denoising for Poisson-Gaussian noise remains a challenging task. Pseudo-supervised pairs constructed from single noisy images re-corrupt the signal and degrade the performance. The visible blindspots solve the information loss in masked inputs. However, without explicitly noise sensing, mean square error as an objective function cannot adjust denoising intensities for dynamic noise levels, leading to noticeable residual noise. In this paper, we propose Blind2Sound, a simple yet effective approach to overcome residual noise in denoised images. The proposed adaptive re-visible loss senses noise levels and performs personalized denoising without noise residues while retaining the signal lossless. The theoretical analysis of intermediate medium gradients guarantees stable training, while the Cramer Gaussian loss acts as a regularization to facilitate the accurate perception of noise levels and improve the performance of the denoiser. Experiments on synthetic and real-world datasets show the superior performance of our method, especially for single-channel images. The code is publicly available from this link.

Wed 22 Oct. 14:15 - 16:15 PDT

#277
STaR: Seamless Spatial-Temporal Aware Motion Retargeting with Penetration and Consistency Constraints

Xiaohang Yang · Qing Wang · Jiahao Yang · Gregory Slabaugh · Shanxin Yuan

Motion retargeting seeks to faithfully replicate the spatio-temporal motion characteristics of a source character onto a target character with a different body shape. Apart from motion semantics preservation, ensuring geometric plausibility and maintaining temporal consistency are also crucial for effective motion retargeting. However, many existing methods prioritize either geometric plausibility or temporal consistency. Neglecting geometric plausibility results in interpenetration while neglecting temporal consistency leads to motion jitter.In this paper, we propose a novel sequence-to-sequence model for seamless \textbf{S}patial-\textbf{T}emporal \textbf{a}ware motion \textbf{R}etargeting (\textbf{STaR}), with penetration and consistency constraints. STaR consists of two modules: (1) a spatial module that incorporates dense shape representation and a novel limb penetration constraint to ensure geometric plausibility while preserving motion semantics, and (2) a temporal module that utilizes a temporal transformer and a novel temporal consistency constraint to predict the entire motion sequence at once while enforcing multi-level trajectory smoothness. The seamless combination of the two modules helps us achieve a good balance between the semantic, geometric, and temporal targets. Extensive experiments on the Mixamo and ScanRet datasets demonstrate that our method produces plausible and coherent motions while significantly reducing interpenetration rates compared with other approaches. Code and model will be released upon acceptance.

Wed 22 Oct. 14:15 - 16:15 PDT

#278
ARIG: Autoregressive Interactive Head Generation for Real-time Conversations

Ying Guo · Xi Liu · Cheng Zhen · Pengfei Yan · Xiaoming Wei

Face-to-face communication, as a common human activity, motivates the research on interactive head generation. A virtual agent can generate motion responses with both listening and speaking capabilities based on the audio or motion signals of the other user and itself. However, previous clip-wise generation paradigm or explicit listener/speaker generator-switching methods have limitations in future signal acquisition, contextual behavioral understanding, and switching smoothness, making it challenging to be real-time and realistic.In this paper, we propose an autoregressive (AR) based frame-wise framework called ARIG to realize the real-time generation with better interaction realism. To achieve real-time generation, we model motion prediction as a non-vector-quantized AR process. Unlike discrete codebook-index prediction, we represent motion distribution using diffusion procedure, achieving more accurate predictions in continuous space. To improve interaction realism, we emphasize interactive behavior understanding (IBU) and detailed conversational state understanding (CSU). In IBU, based on dual-track dual-modal signals, we summarize short-range behaviors through bidirectional-integrated learning and perform contextual understanding over long ranges. In CSU, we use voice activity signals and context features of IBU to understand the various states (interruption, feedback, pause, etc.) that exist in actual conversations. These serve as conditions for the final progressive motion prediction. Extensive experiments have verified the effectiveness of our model.

Wed 22 Oct. 14:15 - 16:15 PDT

#279
Towards a Universal Image Degradation Model via Content-Degradation Disentanglement

Wenbo Yang · Zhongling Wang · Zhou Wang

Image degradation synthesis is highly desirable in a wide variety of applications ranging from image restoration to simulating artistic effects. Existing models are designed to generate one specific or a narrow set of degradations, which often require user-provided degradation parameters. As a result, they lack the generalizability to synthesize degradations beyond their initial design or adapt to other applications. Here we propose the first universal degradation model that can synthesize a broad spectrum of complex and realistic degradations containing both homogeneous (global) and inhomogeneous (spatially varying) components. Our model automatically extracts and disentangles homogeneous and inhomogeneous degradation features, which are later used for degradation synthesis without user intervention. A disentangle-by-compression method is proposed to separate degradation information from images. Two novel modules for extracting and incorporating inhomogeneous degradations are created to model inhomogeneous components in complex degradations. We demonstrate the model’s accuracy and adaptability in film-grain simulation and blind image restoration tasks. The demo video (anonymized version available supplementary material), code, and dataset of this project will be released.

Wed 22 Oct. 14:15 - 16:15 PDT

#280
Highlight
Unified Multimodal Understanding via Byte-Pair Visual Encoding

Wanpeng Zhang · Yicheng Feng · Hao Luo · Yijiang Li · Zihao Yue · Sipeng Zheng · Zongqing Lu

Multimodal large language models (MLLMs) have made significant progress in vision-language understanding, yet effectively aligning different modalities remains a fundamental challenge. We present a framework that unifies multimodal understanding by applying byte-pair encoding to visual tokens. Unlike conventional approaches that rely on modality-specific encoders, our method directly incorporates structural information into visual tokens, mirroring successful tokenization strategies in text-only language models. We introduce a priority-guided encoding scheme that considers both frequency and spatial consistency, coupled with a multi-stage training procedure based on curriculum-driven data composition. These enhancements enable the transformer model to better capture cross-modal relationships and reason with visual information. Comprehensive experiments demonstrate improved performance across diverse vision-language tasks. By bridging the gap between visual and textual representations, our approach contributes to the advancement of more capable and efficient multimodal foundation models.

Wed 22 Oct. 14:15 - 16:15 PDT

#281
IMoRe: Implicit Program-Guided Reasoning for Human Motion Q&A

Chen Li · Chinthani Sugandhika · Ee Yeo Keat · Eric Peh · Hao Zhang · HONG YANG · Deepu Rajan · Basura Fernando

Existing human motion Q\&A methods rely on explicit program execution, where the requirement for manually defined functional modules may limit the scalability and adaptability. To overcome this, we propose an implicit program-guided motion reasoning (IMoRe) framework that unifies reasoning across multiple query types without manually designed modules. Unlike existing implicit reasoning approaches that infer reasoning operations from question words, our model directly conditions on structured program functions, ensuring a more precise execution of reasoning steps. Additionally, we introduce a program-guided reading mechanism, which dynamically selects multi-level motion representations from a pretrained motion Vision Transformer (ViT), capturing both high-level semantics and fine-grained motion cues. The reasoning module iteratively refines memory representations, leveraging structured program functions to extract relevant information for different query types. Our model achieves state-of-the-art performance on Babel-QA and generalizes to a newly constructed motion Q\&A dataset based on HuMMan, demonstrating its adaptability across different motion reasoning datasets.

Wed 22 Oct. 14:15 - 16:15 PDT

#282
AdaDCP: Learning an Adapter with Discrete Cosine Prior for Clear-to-Adverse Domain Generalization

Qi Bi · Yixian Shen · Jingjun Yi · Gui-Song Xia

Vision Foundation Model (VFM) provides an inherent generalization ability to unseen domains for downstream tasks.However, fine-tuning VFM to parsing various adverse scenes (\eg, fog, snow, night) is particularly challenging, as these samples are difficult to collect and annotate.Using easy-to-acquire clear scenes as the source domain is a feasible solution, but a huge domain gap exists between them and clear scenes due to dramatically different scene appearance.In this paper, we propose \texttt{AdaDCP} to effectively fine-tune a VFM for adverse scene segmentation, by only generalizing from a clear source domain. Interestingly, the frequency bands from a VFM exhibit either variant or invariant properties on various adverse weather conditions after discerete cosine transform. Therefore, our \texttt{AdaDCP} is enpowered by three key components: (1) weather-invariant band adapation that provides a foundation to enhance the robustness to adverse scenes; (2) weather-variant band adapation that preceives the weather-specific information from each type of adverse scenes; (3) weather-invariant band alignment that implictly enforces the weather-variant bands to progressively incoperate the weather-invariant information, therefore mitigating the clear-to-adverse domain gap.Experiments conducted on eight unseen adverse scene segmentation datasets show its state-of-the-art performance.

Wed 22 Oct. 14:15 - 16:15 PDT

#283
MP-HSIR: A Multi-Prompt Framework for Universal Hyperspectral Image Restoration

Zhehui Wu · Yong Chen · Naoto Yokoya · Wei He

Hyperspectral images (HSIs) often suffer from diverse and unknown degradations during imaging, leading to severe spectral and spatial distortions. Existing HSI restoration methods typically rely on specific degradation assumptions, limiting their effectiveness in complex scenarios. In this paper, we propose MP-HSIR, a novel multi-prompt framework that effectively integrates spectral, textual, and visual prompts to achieve universal HSI restoration across diverse degradation types and intensities. Specifically, we develop a prompt-guided spatial-spectral transformer, which incorporates spatial self-attention and a prompt-guided dual-branch spectral self-attention. Since degradations affect spectral features differently, we introduce spectral prompts in the local spectral branch to provide universal low-rank spectral patterns as prior knowledge for enhancing spectral reconstruction. Furthermore, the text-visual synergistic prompt fuses high-level semantic representations with fine-grained visual features to encode degradation information, thereby guiding the restoration process. Extensive experiments on 9 HSI restoration tasks, including all-in-one scenarios, generalization tests, and real-world cases, demonstrate that MP-HSIR not only consistently outperforms existing all-in-one methods but also surpasses state-of-the-art task-specific approaches across multiple tasks.

Capturing the spatial patterns of neurons and generating high-fidelity morphological data remain critical challenges in developing biologically realistic large-scale brain network models. Existing methods fail to reconcile anatomical complexity with diversity and computational scalability. We propose MorphoGen, a hierarchical framework integrating global structure prediction through denoising diffusion probabilistic models (DDPMs) with local neurites optimization. The pipeline initiates with DDPM-generated coarse-grained neuronal point clouds, followed by skeletonization and growth-guided linking to derive plausible tree-like structures, and culminates in natural neural fibers refinement via a pragmatic smoothing network. Comprehensive evaluations across three distinct long-range projection neuron datasets demonstrate that the proposed method improves 1-Nearest Neighbor Accuracy by approximately 12\% on average compared to state-of-the-art baseline, reduces average training time by around 55\%, and aligns the distributions of several morphometrics with real data. This work establishes a novel global-to-local paradigm for neuronal morphology generation, offering a more direct and efficient approach compared to current branch-sequential modeling methods. Code is provided in the supplementary materials and will be publicly available upon acceptance.

Wed 22 Oct. 14:15 - 16:15 PDT

#285
Adaptive Articulated Object Manipulation On The Fly with Foundation Model Reasoning and Part Grounding

Xiaojie Zhang · Yuanfei Wang · Ruihai Wu · Kunqi Xu · Yu Li · Liuyu Xiang · Hao Dong · Zhaofeng He

Articulated objects pose diverse manipulation challenges for robots. Since their internal structures are not directly observable, robots must adaptively explore and refine actions to generate successful manipulation trajectories. While existing works have attempted cross-category generalization in adaptive articulated object manipulation, two major challenges persist: (1) the geometric diversity of real-world articulated objects complicates visual perception and understanding, and (2) variations in object functions and mechanisms hinder the development of a unified adaptive manipulation strategy.To address these challenges, we propose \textbf{AdaRPG}, a novel framework that leverages foundation models to extract object parts, which exhibit greater local geometric similarity than entire objects, thereby enhancing visual affordance generalization for functional primitive skills. To support this, we construct a part-level affordance annotation dataset to train the affordance model. Additionally, AdaRPG utilizes the common knowledge embedded in foundation models to reason about complex mechanisms and generate high-level control codes that invoke primitive skill functions based on part affordance inference.Simulation and real-world experiments demonstrate AdaRPG’s strong generalization ability across novel articulated object categories.

Wed 22 Oct. 14:15 - 16:15 PDT

#286
Scoring, Remember, and Reference: Catching Camouflaged Objects in Videos

Yuang Feng · Shuyong Gao · Fuzhen Yan · Yicheng Song · Lingyi Hong · Junjie Hu · Wenqiang Zhang

Video Camouflaged Object Detection (VCOD) aims to segment objects whose appearances closely resemble their surroundings, posing a challenging and emerging task. Existing vision models often struggle in such scenarios due to the indistinguishable appearance of camouflaged objects and the insufficient exploitation of dynamic information in videos. To address these challenges, we propose an end-to-end VCOD framework inspired by human memory-recognition, which leverages historical video information by integrating memory reference frames for camouflaged sequence processing. Specifically, we design a dual-purpose decoder that simultaneously generates predicted masks and scores, enabling reference frame selection based on scores while introducing auxiliary supervision to enhance feature extraction.Furthermore, this study introduces a novel reference-guided multilevel asymmetric attention mechanism, effectively integrating long-term reference information with short-term motion cues for comprehensive feature extraction. By combining these modules, we develop the \textbf{Scoring, Remember, and Reference (SRR)} framework, which efficiently extracts information to locate targets and employs memory guidance to improve subsequent processing. With its optimized module design and effective utilization of video data, our model achieves significant performance improvements, surpassing existing approaches by 10\% on benchmark datasets while requiring fewer parameters (54M) and only a single pass through the video. The code will be made publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#287
Stylized-Face: A Million-level Stylized Face Dataset for Face Recognition

Zhengyuan Peng · Jianqing Xu · Yuge Huang · Jinkun Hao · Shouhong Ding · zhizhong zhang · Xin TAN · Lizhuang Ma

Stylized face recognition is the task of recognizing generated faces with the same ID across diverse stylistic domains (e.g., anime, painting, cyberpunk styles). This emerging field plays a vital role in the governance of generative image, serving the primary objective: Recognize the ID information of stylized faces to detect potential infringements of portrait rights. Despite its importance, progress in stylized face recognition has been hindered by the lack of large-scale, stylistically diverse datasets. To address this gap, we introduce the \textbf{Stylized-Face} dataset, which is the first dataset specifically designed for stylized face recognition. Stylized-Face dataset includes 4.6 million images across 62k IDs, specifically curated to enhance model performance in stylized face recognition tasks. To ensure data quality (i.e., ID preservation) at this massive scale, we implement a semi-automated pipeline for large-scale data cleaning. Based on the Stylized-Face dataset, we establish three benchmarks to evaluate the robustness and generalization of recognition models across various scenarios, including within-distribution performance, cross-prompt generalization, and cross-method generalization, which target key challenges in stylized face recognition. Experimental results demonstrate that models trained on Stylized-Face achieve remarkable improvements in both stylized face recognition performance (a 15.9% improvement in TAR at FAR=1e-4) and generalization (a 13.3% improvement in TAR at FAR=1e-3 in cross-method generalization).

Wed 22 Oct. 14:15 - 16:15 PDT

#288
GaussianSpeech: Audio-Driven Personalized 3D Gaussian Avatars

Shivangi Aneja · Artem Sevastopolsky · Tobias Kirschstein · Justus Thies · Angela Dai · Matthias Nießner

We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photorealistic and personalized multi-view consistent 3D human head avatars from spoken audio at real-time rendering rates. To capture the expressive and detailed nature of human heads, including skin furrowing and fine facial movements, we propose to couple speech signal with 3D Gaussian splatting to create photorealistic and temporally coherent motion sequences. We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize dynamic facial details at real-time rendering. Next, we devise an audio-conditioned transformer model to extract lip and wrinkle features from the audio input and combine with our 3D avatar by performing joint 3D sequence refinement to synthesize photorealistic animations. To the best of our knowledge, this is the first work for generating photorealistic multi-view 3D head avatar sequence only from spoken audio, representing a significant advancement in the field of audio-driven 3D facial animation. In the absence of high-quality multi-view talking face dataset, we captured a new large-scale multi-view dataset of audio-visual sequences of native English speakers and diverse facial geometry. GaussianSpeech achieves state-of-the-art quality consistent with the avatar's speaking style.

Wed 22 Oct. 14:15 - 16:15 PDT

#289
A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition

Jie Zhu · Yiyang Su · Minchul Kim · Anil Jain · Xiaoming Liu

Whole-body biometric recognition is a challenging multi-modal task that integrates various biometric modalities, including face, gait, and body. This integration is essential for overcoming the limitations of unimodal systems. Traditionally, whole-body recognition involves deploying different models to process multiple modalities, achieving the final outcome by score-fusion (e.g., weighted averaging similarity matrices from each model). However, these conventional methods may overlook the variations in score distributions of individual modalities, making it challenging to improve final performance. In this work, we present $\textbf{Q}$uality-guided $\textbf{M}$ixture of score-fusion $\textbf{E}$xperts (QME), a novel framework designed for improving whole-body biometric recognition performance through a learnable score-fusion strategy using a Mixture of Experts (MoE). We introduce a novel pseudo quality loss for quality estimation with a modality-specific Quality Estimator (QE), and a score triplet loss to improve the metric performance. Extensive experiments on multiple whole-body biometric datasets demonstrate the effectiveness of our proposed approach, achieving state-of-the-art results across various metrics compared to baseline methods. Our method is effective for multi-modal and multi-model, addressing key challenges such as model misalignment in the similarity score domain and variability in data quality. Code will be publicly released upon publication.

Wed 22 Oct. 14:15 - 16:15 PDT

#290
VMBench: A Benchmark for Perception-Aligned Video Motion Generation

Xinran Ling · Chen Zhu · Meiqi Wu · Hangyu Li · Xiaokun Feng · Cundian Yang · Aiming Hao · Jiashu Zhu · Jiahong Wu · Xiangxiang Chu

Video generation has advanced rapidly, improving evaluation methods, yet assessing video's motion remains a major challenge. Specifically, there are two key issues: 1) current motion metrics do not fully align with human perceptions; 2) the existing motion prompts are limited. Based these findings, we introduce VMBench—a comprehensive Video Motion Benchmark that has perception-aligned motion metrics and features the most diverse types of motion. VMBench has several appealing properties: (1) Perception-Driven Motion Evaluation Metrics, we identify five dimensions based on human perception in motion video assessment and develop fine-grained evaluation metrics, providing deeper insights into models' strengths and weaknesses in motion quality. (2) Meta-Guided Motion Prompt Generation, a structured method that extracts meta-information, generates diverse motion prompts with LLMs, and refines them through human-AI validation, resulting in a multi-level prompt library covering six key dynamic scene dimensions. (3) Human-Aligned Validation Mechanism, we provide human preference annotations to validate our benchmarks, with our metrics achieving an average 35.3\% improvement in Spearman’s correlation over baseline methods. This is the first time that the quality of motion in videos has been evaluated from the perspective of human perception alignment. Additionally, we will soon release VMBench as an open-source benchmark, setting a new standard for evaluating and advancing motion generation models.

Wed 22 Oct. 14:15 - 16:15 PDT

#291
Capturing head avatar with hand contacts from a monocular video

Haonan He · Yufeng Zheng · Jie Song

Photorealistic 3D head avatars are vital for telepresence, gaming, and VR. However, most methods focus solely on facial regions, ignoring natural hand-face interactions, such as a hand resting on the chin or fingers gently touching the cheek, which convey cognitive states like pondering. In this work, we present a novel framework that jointly learns detailed head avatars and the non-rigid deformations induced by hand-face interactions.There are two principal challenges in this task. First, naively tracking hand and face separately fails to capture their relative poses. To overcome this, we propose to combine depth order loss with contact regularization during pose tracking, ensuring correct spatial relationships between the face and hand. Second, no publicly available priors exist for hand-induced deformations, making them non-trivial to learn from monocular videos. To address this, we learn a PCA basis specific to hand-induced facial deformations from a face-hand interaction dataset. This reduces the problem to estimating a compact set of PCA parameters rather than a full spatial deformation field. Furthermore, inspired by physics-based simulation, we incorporate a contact loss that provides additional supervision, significantly reducing interpenetration artifacts and enhancing the physical plausibility of the results.We evaluate our approach on RGB(D) videos captured by an iPhone. Additionally, to better evaluate the reconstructed geometry, we construct a synthetic dataset of avatars with various types of hand interactions. We show that our method can capture better appearance and more accurate deforming geometry of the face than SOTA surface reconstruction methods.

Wed 22 Oct. 14:15 - 16:15 PDT

#292
Highlight
Tiling artifacts and trade-offs of feature normalization in the segmentation of large biological images

Elena Buglakova · Anwai Archit · Edoardo D'Imprima · Julia Mahamid · Constantin Pape · Anna Kreshuk

Segmentation of very large images is a common problem in microscopy, medical imaging or remote sensing. The problem is usually addressed by sliding window inference, which can theoretically lead to seamlessly stitched predictions. However, in practice many of the popular pipelines still suffer from tiling artifacts. We investigate the root cause of these issues and show that they stem from the normalization layers within the neural networks. We propose indicators to detect normalization issues and further explore the trade-offs between artifact-free and high-quality predictions, using three diverse microscopy datasets as examples. Finally, we propose to use BatchRenorm as the most suitable normalization strategy, which effectively removes tiling artifacts and enhances transfer performance, thereby improving the reusability of trained networks for new datasets.

Wed 22 Oct. 14:15 - 16:15 PDT

#293
BlueNeg: A 35mm Negative Film Dataset for Restoring Channel-Heterogeneous Deterioration

Hanyuan Liu · Chengze Li · Minshan Xie · Wang Zhenni · Jiawen Liang · Chi LEUNG · Tien-Tsin Wong

While digitally acquired photographs have been dominating since around 2000, there remains a huge amount of legacy photographs being acquired by optical cameras and are stored in the form of negative films. In this paper, we focus on the unique phenomenon of deterioration on negative films and propose the first high-quality 35mm negative film dataset BlueNeg for restoring channel-heterogeneous deterioration. We would like to bring attention to this under-explored research area of image restoration on channel-heterogeneous deterioration. However, a large portion of the collected negative films are already contaminated, so we do not have non-corrupted version or the ground truth of these photos, which poses a challenge in evaluating the restoration performance. To address this, we leverage the printed photos from the same negative films, which do not suffer from the channel-heterogeneous deterioration, for quantitative evaluation. We propose a reverse-developing process to generate the estimated ground truth from the printed photos and design an evaluation protocol for evaluating the restoration performance. With the collected data and the proposed evaluation protocol, we find existing image restoration methods cannot perform well on our dataset, requiring specially designed tools for better restoration. We hope that our dataset and benchmark will inspire future research in this area, especially in the context of legacy photograph restoration for preserving historical moments and archival purposes. Our dataset will be publicly available at HuggingFace Hub under a derivative license based on CC-BY.

Wed 22 Oct. 14:15 - 16:15 PDT

#294
GenM3: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation

Junyu Shi · Lijiang LIU · Yong Sun · Zhiyuan Zhang · JINNI ZHOU · Qiang Nie

Scaling up motion datasets is crucial to enhance motion generation capabilities. However, training on large-scale multi-source datasets introduces data heterogeneity challenges due to variations in motion content. To address this, we propose Generative Pretrained Multi-path Motion Model (GenM$^3$), a comprehensive framework designed to learn unified motion representations. GenM$^3$ comprises two components: 1) a Multi-Expert VQ-VAE (MEVQ-VAE) that adapts to different dataset distributions to learn a unified discrete motion representation, and 2) a Multi-path Motion Transformer (MMT) that improves intra-modal representations by using separate modality-specific pathways, each with densely activated experts to accommodate variations within that modality, and improves inter-modal alignment by the text-motion shared pathway. To enable large-scale training, we integrate and unify 11 high-quality motion datasets (approximately 220 hours of motion data) and augment it with textual annotations (nearly 10,000 motion sequences labeled by a large language model and 300+ by human experts). After training on our integrated dataset, GenM$^3$ achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin. It also demonstrates strong zero-shot generalization on IDEA400 dataset, highlighting its effectiveness and adaptability across diverse motion scenarios.

Wed 22 Oct. 14:15 - 16:15 PDT

#295
Saliency-Aware Quantized Imitation Learning for Efficient Robotic Control

Seongmin Park · Hyungmin Kim · Sangwoo kim · Wonseok Jeon · Juyoung Yang · Byeongwook Jeon · Yoonseon Oh · Jungwook Choi

Deep neural network (DNN)-based policy models, such as vision-language-action (VLA) models, excel at automating complex decision-making from multi-modal inputs. However, scaling these models greatly increases computational overhead, complicating deployment in resource-constrained settings like robot manipulation and autonomous driving. To address this, we propose Saliency-Aware Quantized Imitation Learning (\method), which combines quantization-aware training with a selective loss-weighting strategy for mission-critical states. By identifying these states via saliency scores and emphasizing them in the training loss, \method preserves decision fidelity under low-bit precision. We validate \method's generalization capability across extensive simulation benchmarks with environment variations, real-world tasks, and cross-domain tasks (self-driving, physics simulation), consistently recovering full-precision performance. Notably, a 4-bit weight-quantized VLA model for robotic manipulation achieves up to 2.5$\times$ speedup and 2.5$\times$ energy savings on an edge GPU with minimal accuracy loss. These results underline \method’s potential for efficiently deploying large IL-based policy models on resource-limited devices.

Wed 22 Oct. 14:15 - 16:15 PDT

#296
MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation

Sungwoo Cho · Jeongsoo Choi · Sungnyun Kim · Se-Young Yun

Despite recent advances in text-to-speech (TTS) models, audio-visual-to-audio-visual (AV2AV) translation still faces a critical challenge: maintaining speaker consistency between the original and translated vocal and facial features. To address this issue, we propose a conditional flow matching (CFM) zero-shot audio-visual renderer that utilizes strong dual guidance from both audio and visual modalities. By leveraging multi-modal guidance with CFM, our model robustly preserves speaker-specific characteristics and significantly enhances zero-shot AV2AV translation abilities. For the audio modality, we enhance the CFM process by integrating detailed speaker embeddings with x-vectors, which serve to bolster speaker consistency. Additionally, we convey emotional nuances to the face rendering module. The guidance provided by both audio and visual cues remains independent of semantic or linguistic content, allowing our renderer to effectively handle zero-shot translation tasks for monolingual speakers in different languages. We empirically demonstrate that the inclusion of high-quality mel-spectrograms conditioned on facial information not only enhances the quality of the synthesized speech but also positively influences facial generation, leading to overall performance improvements.

Wed 22 Oct. 14:15 - 16:15 PDT

#297
Privacy-centric Deep Motion Retargeting for Anonymization of Skeleton-Based Motion Visualization

Thomas Carr · Depeng Xu · Shuhan Yuan · Aidong Lu

Capturing and visualizing motion using skeleton-based techniques is a key aspect of computer vision, particularly in virtual reality (VR) settings. Its popularity has surged, driven by the simplicity of obtaining skeleton data and the growing appetite for virtual interaction. Although this skeleton data appears to be non-identifiable, it can be exploited to derive personally identifiable information (PII), posing a risk of inadvertent privacy breaches. In this paper, we explore the application of motion retargeting and its ability to mitigate privacy leakages. Motion retargeting can effectively transfer the motion from an initial user onto a dummy skeleton with the purpose of hiding PII. We propose a Privacy-centric Deep Motion Retargeting model (PMR), which mitigates the PII through adversarial learning. In our evaluation, our proposed model achieves motion retargeting performance on par with the current state-of-the-art models. More importantly, it effectively prevents the attackers from identifying the initial user.

Wed 22 Oct. 14:15 - 16:15 PDT

#298
Highlight
ChartCap: Mitigating Hallucination of Dense Chart Captioning

Junyoung Lim · Jaewoo Ahn · Gunhee Kim

Generating accurate, informative, and hallucination-free captions for charts remains challenging for vision language models, primarily due to the lack of large-scale, high-quality datasets of real-world charts. However, existing real-world chart datasets suffer from the inclusion of extraneous information that cannot be inferred from the chart and failure to sufficiently capture structural elements and key insights. Therefore, we introduce ChartCap, a large-scale dataset of 565K real-world chart images paired with type-specific, dense captions that exclude extraneous information and highlight both structural elements and key insights in detail. To build ChartCap, we design a four-stage pipeline that generates captions using only the discernable data from the chart and employ a cycle consistency-based human verification, which accelerates quality control without sacrificing accuracy. Additionally, we propose a novel metric, the Visual Consistency Score, which evaluates caption quality by measuring the similarity between the chart regenerated from a caption and the original chart, independent of reference captions. Extensive experiments confirms that models fine-tuned on ChartCap consistently generate more accurate and informative captions with reduced hallucinations, surpassing not only open-source and proprietary models but also even human-annotated captions.

Wed 22 Oct. 14:15 - 16:15 PDT

#299
GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning

Kelin Yu · Sheng Zhang · Harshit Soora · Furong Huang · Heng Huang · Pratap Tokekar · Ruohan Gao

Recent advances have shown that video generation models can enhance robot learning by deriving effective robot actions through inverse dynamics. However, these methods heavily depend on the quality of generated data and struggle with fine-grained manipulation due to the lack of environment feedback. While video-based reinforcement learning improves policy robustness, it remains constrained by the uncertainty of video generation and the challenges of collecting large-scale robot datasets for training diffusion models. To address these limitations, we propose GenFlowRL, which derives shaped rewards from generated flow trained from easy-to-collect cross-embodiment datasets. This enables learning generalizable and robust policies from expert demostrations using low-dimensional, object-centric features. Experiments on 10 manipulation tasks, both in simulation and real-world cross-embodiment evaluations, demonstrate that GenFlowRL effectively leverages manipulation features extracted from generated object-centric flow, consistently achieving superior performance across diverse and challenging scenarios.

Wed 22 Oct. 14:15 - 16:15 PDT

#300
Align Your Rhythm: Generating Highly Aligned Dance Poses with Gating-Enhanced Rhythm-Aware Feature Representation

Congyi Fan · Jian Guan · Xuanjia Zhao · Dongli Xu · Youtian Lin · Tong Ye · Pengming Feng · Haiwei Pan

Automatically generating natural, diverse and rhythmic human dance movements driven by music is vital for virtual reality and film industries. However, generating dance that naturally follows music remains a challenge, as existing methods lack proper beat alignment and exhibit unnatural motion dynamics. In this paper, we propose Danceba, a novel framework that leverages gating mechanism to enhance rhythm-aware feature representation for music-driven dance generation, which achieves highly aligned dance poses with enhanced rhythmic sensitivity. Specifically, we introduce Phase-Based Rhythm Extraction (PRE) to precisely extract rhythmic information from musical phase data, capitalizing on the intrinsic periodicity and temporal structures of music. Additionally, we propose Temporal-Gated Causal Attention (TGCA) to focus on global rhythmic features, ensuring that dance movements closely follow the musical rhythm. We also introduce Parallel Mamba Motion Modeling (PMMM) architecture to separately model upper and lower body motions along with musical features, thereby improving the naturalness and diversity of generated dance movements. Extensive experiments confirm that Danceba outperforms state-of-the-art methods, achieving significantly better rhythmic alignment and motion diversity.

Wed 22 Oct. 14:15 - 16:15 PDT

#301
GMMamba: Group Masking Mamba for Whole Slide Image Classification

Tingting Zheng · Hongxun Yao · Kui Jiang · Yi Xiao · Sicheng Zhao

Recent advances in selective state space models (Mamba) have shown great promise in whole slide image (WSI) classification. Despite this, WSIs contain explicit local redundancy (similar patches) and irrelevant regions (uninformative instances), posing significant challenges for Mamba-based multi-instance learning (MIL) methods in capturing global representations. Furthermore, bag-level approaches struggle to extract critical features from all instances, while group-level methods fail to adequately account for tumor dispersion and intrinsic correlations across groups, leading to suboptimal global representations. To address these issues, we propose group masking Mamba (GMMamba), a novel framework that combines two elaborate modules: (1) intra-group masking Mamba (IMM) for selective instance exploration within groups, and (2) cross-group super-feature sampling (CSS) to ameliorate long-range relation learning. Specifically, IMM adaptively predicts sparse masks to filter out features with low attention scores (i.e., uninformative patterns) during bidirectional Mamba modeling, facilitating the removal of instance redundancies for compact local representation. For improved bag prediction, the CSS module further aggregates sparse group representations into discriminative features, effectively grasping comprehensive dependencies among dispersed and sparse tumor regions inherent in large-scale WSIs. Extensive experiments on four datasets demonstrate that GMMamba outperforms the state-of-the-art ACMIL by 2.2\% and 6.4\% in accuracy on the TCGA-BRCA and TCGA-ESCA datasets, respectively.

Wed 22 Oct. 14:15 - 16:15 PDT

#302
Understanding Co-speech Gestures in-the-wild

Sindhu Hegde · K R Prajwal · Taein Kwon · Andrew Zisserman

Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations: (i) gesture-based retrieval, (ii) gestured word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs), across all three tasks. Further analysis reveals that speech and text modalities capture distinct gesture-related signals, underscoring the advantages of learning a shared tri-modal embedding space. All code, models, and data annotations will be released to support future research.

Wed 22 Oct. 14:15 - 16:15 PDT

#303
Motion Synthesis with Sparse and Flexible Keyjoint Control

Inwoo Hwang · Jinseok Bae · Donggeun Lim · Young Min Kim

Creating expressive character animations is labor-intensive, requiring intricate manual adjustment of animators across space and time. Previous works on controllable motion generation often rely on a predefined set of dense spatio-temporal specifications (e.g., dense pelvis trajectories with exact per-frame timing), limiting practicality for animators.To process high-level intent and intuitive control in diverse scenarios, we propose a practical controllable motions synthesis framework that respects sparse and flexible keyjoint signals.Our approach employs a decomposed diffusion-based motion synthesis framework that first synthesizes keyjoint movements from sparse input control signals and then synthesizes full-body motion based on the completed keyjoint trajectories. The low-dimensional keyjoint movements can easily adapt to various control signal types, such as end-effector position for diverse goal-driven motion synthesis, or incorporate functional constraints on a subset of keyjoints.Additionally, we introduce a time-agnostic control formulation, eliminating the need for frame-specific timing annotations and enhancing control flexibility. Then, the shared second stage can synthesize a natural whole-body motion that precisely satisfies the task requirement from dense keyjoint movements.We demonstrate the effectiveness of sparse and flexible keyjoint control through comprehensive experiments on diverse datasets and scenarios.

Wed 22 Oct. 14:15 - 16:15 PDT

#304
Highlight
UniPhys: Unified Planner and Controller with Diffusion for Flexible Physics-Based Character Control

Yan Wu · Korrawe Karunratanakul · Zhengyi Luo · Siyu Tang

Generating natural and physically plausible character motion remains challenging, particularly for long-horizon control with diverse guidance signals. While prior work combines high-level diffusion-based motion planners with low-level physics controllers, these systems suffer from domain gaps that degrade motion quality and require task-specific fine-tuning.To tackle this problem, we introduce UniPhys, a diffusion-based behavior cloning framework that unifies motion planning and control into a single model. UniPhys enables flexible, expressive character motion conditioned on multi-modal inputs such as text, trajectories, and goals. To address accumulated prediction errors over long sequences, UniPhys is trained with the Diffusion Forcing paradigm, learning to denoise noisy motion histories and handle discrepancies introduced by the physics simulator. This design allows UniPhys to robustly generate physically plausible, long-horizon motions. Through guided sampling, UniPhys generalizes to a wide range of control signals, including unseen ones, without requiring task-specific fine-tuning. Experiments show that UniPhys outperforms prior methods in motion naturalness, generalization, and robustness across diverse control tasks.

Wed 22 Oct. 14:15 - 16:15 PDT

#305
MMAD: Multi-label Micro-Action Detection in Videos

Kun Li · pengyu Liu · Dan Guo · Fei Wang · zhiliang wu · Hehe Fan · Meng Wang

Human body actions are an important form of non-verbal communication in social interactions. This paper specifically focuses on a subset of body actions known as micro-actions, which are subtle, low-intensity body movements with promising applications in human emotion analysis. In real-world scenarios, human micro-actions often temporally co-occur, with multiple micro-actions overlapping in time, such as concurrent head and hand movements. However, current research primarily focuses on recognizing individual micro-actions while overlooking their co-occurring nature. To address this gap, we propose a new task named Multi-label Micro-Action Detection (MMAD), which involves identifying all micro-actions in a given short video, determining their start and end times, and categorizing them. Accomplishing this requires a model capable of accurately capturing both long-term and short-term action relationships to detect multiple overlapping micro-actions. To facilitate the MMAD task, we introduce a new dataset named Multi-label Micro-Action-52 (MMA-52) and propose a baseline method equipped with a dual-path spatial-temporal adapter to address the challenges of subtle visual change in MMAD. We hope that MMA-52 can stimulate research on micro-action analysis in videos and prompt the development of spatio-temporal modeling in human-centric video understanding.

Wed 22 Oct. 14:15 - 16:15 PDT

#306
UniRes: Universal Image Restoration for Complex Degradations

Mo Zhou · Keren Ye · Mauricio Delbracio · Peyman Milanfar · Vishal Patel · Hossein Talebi

Real-world image restoration is hampered by diverse degradations stemming from varying capture conditions, capture devices and post-processing pipelines. Existing works make improvements through simulating those degradations and leveraging image generative priors, however generalization to in-the-wild data remains an unresolved problem. In this paper, we focus on complex degradations, i.e., arbitrary mixtures of multiple types of known degradations, which is frequently seen in the wild. A simple yet flexible diffusion-based framework, named UniRes, is proposed to address such degradations in an end-to-end manner. It combines several specialized models during the diffusion sampling steps, hence transferring the knowledge from several well-isolated restoration tasks to the restoration of complex in-the-wild degradations. This only requires well-isolated training data for several degradation types. The framework is flexible as extensions can be added through a unified formulation, and the fidelity-quality trade-off can be adjusted through a new paradigm. Our proposed method is evaluated on both complex-degradation and single-degradation image restoration datasets. Extensive qualitative and quantitative experimental results show consistent performance gain especially for images with complex degradations.

Wed 22 Oct. 14:15 - 16:15 PDT

#307
SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation

Chun-Han Yao · Yiming Xie · Vikram Voleti · Huaizu Jiang · Varun Jampani

We present Stable Video 4D 2.0 (SV4D 2.0), a multi-view video diffusion model for dynamic 3D asset generation. Compared to its predecessor SV4D, SV4D 2.0 is more robust to occlusions and large motion, generalizes better to real-world videos, and produces higher-quality outputs in terms of detail sharpness and spatio-temporal consistency. We achieve this by introducing key improvements in multiple aspects: 1) network architecture: eliminating the dependency of reference multi-views and designing blending mechanism for 3D and frame attention, 2) data: enhancing quality and quantity of training data, 3) training strategy: adopting progressive 3D-4D training for better generalization, and 4) 4D optimization: handling 3D inconsistency and large motion via 2-stage refinement and progressive frame sampling. Extensive experiments demonstrate significant performance gain by SV4D 2.0 both visually and quantitatively, achieving better detail (-14\% LPIPS) and 4D consistency (-44\% FV4D) in novel-view video synthesis and 4D optimization (-12\% LPIPS and -24\% FV4D) compared to SV4D.

Wed 22 Oct. 14:15 - 16:15 PDT

#308
Gait-X: Exploring X modality for Generalized Gait Recognition

Zengbin Wang · Saihui Hou · Junjie Li · Xu Liu · Chunshui Cao · Yongzhen Huang · Siye Wang · Man Zhang

Modality exploration in gait recognition has been repeatedly mentioned as a core research topic, evolving from binary silhouette to some promising modalities like parsing, mesh, point clouds, etc. These latest modalities agree that silhouette is less affected by background and clothing noises, but argue it loses too much valuable discriminative information. They seek to retain the strengths of silhouette while extracting more semantic or structural information through upstream estimation for better recognition. We agree with this principle but argue that these upstream estimations are usually unstable and the resulted modalities rely on pre-defined design. Moreover, the crucial aspect of modality generalization remains underexplored. To address this, inspired by the stability and high-dimension analysis in frequency decomposition, we propose Gait-X to explore how to flexibly and stably develop a gait-specific generalized X modality from a frequency perspective. Specifically, 1) We replace upstream estimation with stable frequency decomposition and conduct a comprehensive analysis of how different frequencies impact modality and within-/cross-domain performance; 2) To enable flexible modality customization and mitigate the influence of noise and domain variations, we propose to remove irrelevant low-frequency noise and suppress high-frequency domain-specific information to form our X modality; 3) To further improve model generalization, we expand the representation across multiple frequencies to guide the model in balancing whole frequencies for enhanced generalization. Extensive experiments on CCPG, SUSTech1K, and CASIA-B datasets show superior within- and cross-domain performance.

Wed 22 Oct. 14:15 - 16:15 PDT

#309
MVTrajecter: Multi-View Pedestrian Tracking with Trajectory Motion Cost and Trajectory Appearance Cost

Taiga Yamane · Ryo Masumura · Satoshi Suzuki · Shota Orihashi

Multi-View Pedestrian Tracking (MVPT) aims to track pedestrians in the form of a bird's eye view occupancy map from multi-view videos.End-to-end methods that detect and associate pedestrians within one model have shown great progress in MVPT.The motion and appearance information of pedestrians is important for the association, but previous end-to-end MVPT methods rely only on the current and its single adjacent past timestamp, discarding the past trajectories before that.This paper proposes a novel end-to-end MVPT method called Multi-View Trajectory Tracker (MVTrajecter) that utilizes information from multiple timestamps in past trajectories for robust association.MVTrajecter introduces trajectory motion cost and trajectory appearance cost to effectively incorporate motion and appearance information, respectively.These costs calculate which pedestrians at the current and each past timestamp are likely identical based on the information between those timestamps.Even if a current pedestrian could be associated with a false pedestrian at some past timestamp, these costs enable the model to associate that current pedestrian with the correct past trajectory based on other past timestamps.In addition, MVTrajecter effectively captures the relationships between multiple timestamps leveraging the attention mechanism.Extensive experiments demonstrate the effectiveness of each component in MVTrajecter and show that it outperforms the previous state-of-the-art methods.

Wed 22 Oct. 14:15 - 16:15 PDT

#310
Highlight
A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions

Youliang Zhang · Ronghui Li · Yachao Zhang · Liang Pan · Jingbo Wang · Yebin Liu · Xiu Li

Extracting physically plausible 3D human motion from videos is a critical task. Although existing simulation-based motion imitation methods can enhance the physical quality of daily motions estimated from monocular video capture, extending this capability to high-difficulty motions remains an open challenge. This can be attributed to some flawed motion clips in video-based motion capture results and the inherent complexity in modeling high-difficulty motions. Therefore, sensing the advantage of segmentation in localizing human body, we introduce a mask-based motion correction module (MCM) that leverages motion context and video mask to repair flawed motions; and propose a physics-based motion transfer module (PTM), which employs a prior injected pretrain and adapt approach for motion imitation, improving physical plausibility with the ability to handle in-the-wild and challenging motions. Our approach is designed as a plug-and-play module to physically refine the video motion capture, which also excels in motion generation tasks. Finally, we collected a challenging in-the-wild test set to establish a benchmark, and our method has demonstrated effectiveness on both the new benchmark and existing public datasets. Our project page is : https://physicalmotionrestoration.github.io/

Wed 22 Oct. 14:15 - 16:15 PDT

#311
Highlight
GazeGaussian: High-Fidelity Gaze Redirection with 3D Gaussian Splatting

Xiaobao Wei · Peng Chen · Guangyu Li · Ming Lu · Hui Chen · Feng Tian

Gaze estimation encounters generalization challenges when dealing with out-of-distribution data. To address this problem, recent methods use neural radiance fields (NeRF) to generate augmented data. However, existing methods based on NeRF are computationally expensive and lack facial details. 3D Gaussian Splatting (3DGS) has become the prevailing representation of neural fields. While 3DGS has been extensively examined in head avatars, it faces challenges with accurate gaze control and generalization across different subjects. In this work, we propose GazeGaussian, the first high-fidelity gaze redirection method that uses a two-stream 3DGS model to represent the face and eye regions separately. Leveraging the unstructured nature of 3DGS, we develop a novel representation of the eye for rigid eye rotation based on the target gaze direction. To enable synthesis generalization across various subjects, we integrate an expression-guided module to inject subject-specific information into the neural renderer. Comprehensive experiments show that GazeGaussian outperforms existing methods in rendering speed, gaze redirection accuracy, and facial synthesis across multiple datasets. The code will be released.

Wed 22 Oct. 14:15 - 16:15 PDT

#312
Highlight
MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction

Zijian Dong · Longteng Duan · Jie Song · Michael Black · Andreas Geiger

We present MoGA, a novel method to reconstruct high-fidelity 3D Gaussian avatars from a single-view image. The main challenge lies in inferring unseen appearance and geometric details while ensuring 3D consistency and realism. Most previous methods rely on 2D diffusion models to synthesize unseen views; however, these generated views are sparse and inconsistent, resulting in unrealistic 3D artifacts and blurred appearance. To address these limitations, we leverage a generative avatar model, that can generate diverse 3D avatars by sampling deformed Gaussians from a learned prior distribution. Due to the limited amount of 3D training data, such a model alone cannot capture all image details of unseen identities. Consequently, we integrate it as a prior, ensuring 3D consistency by projecting input images into its latent space and enforcing additional 3D appearance and geometric constraints. Our novel approach formulates Gaussian avatar creation as a model inversion process by fitting the generative avatar to synthetic views from 2D diffusion models. The generative avatar provides a meaningful initialization for model fitting, enforces 3D regularization, and helps in refining pose estimation. Experiments show that our method surpasses state-of-the-art techniques and generalizes well to real-world scenarios. Our Gaussian avatars are also inherently animatable

Wed 22 Oct. 14:15 - 16:15 PDT

#313
Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Yujie Zhou · Jiazi Bu · Pengyang Ling · Pan Zhang · Tong Wu · Qidong Huang · Jinsong Li · Xiaoyi Dong · Yuhang Zang · Yuhang Cao · Anyi Rao · Jiaqi Wang · Li Niu

Recent advancements in image relighting models, driven by large-scale datasets and pre-trained diffusion models, have enabled the imposition of consistent lighting. However, video relighting still lags, primarily due to the excessive training costs and the scarcity of diverse, high-quality video relighting datasets. A simple application of image relighting models on a frame-by-frame basis leads to several issues: lighting source inconsistency and relighted appearance inconsistency, resulting in flickers in the generated videos. In this work, we propose Light-A-Video, a training-free approach to achieve temporally smooth video relighting. Adapted from image relighting models, Light-A-Video introduces two key techniques to enhance lighting consistency. First, we design a Consistent Light Attention (CLA) module, which enhances cross-frame interactions within the self-attention layers of the image relight model to stabilize the generation of the background lighting source. Second, leveraging the physical principle of light transport independence, we apply linear blending between the source video’s appearance and the relighted appearance, using a Progressive Light Fusion (PLF) strategy to ensure smooth temporal transitions in illumination. Experiments show that Light-A-Video improves the temporal consistency of relighted video while maintaining the relighted image quality, ensuring coherent lighting transitions across frames.

Wed 22 Oct. 14:15 - 16:15 PDT

#314
SVG-Head: Hybrid Surface-Volumetric Gaussians for High-Fidelity Head Reconstruction and Real-Time Editing

Heyi Sun · Cong Wang · Tian-Xing Xu · Jingwei Huang · Di Kang · Chunchao Guo · Song-Hai Zhang

Creating high-fidelity and editable head avatars is a pivotal challenge in computer vision and graphics, boosting many AR/VR applications. While recent advancements have achieved photorealistic renderings and plausible animation, head editing, especially real-time appearance editing, remains challenging due to the implicit representation and entangled modeling of the geometry and global appearance. To address this, we propose Surface-Volumetric Gaussian Head Avatar (SVG-Head), a novel hybrid representation that explicitly models the geometry with 3D Gaussians bound on a FLAME mesh and leverages disentangled texture images to capture the global appearance. Technically, it contains two types of Gaussians, in which surface Gaussians explicitly model the appearance of head avatars using learnable texture images, facilitating real-time texture editing, while volumetric Gaussians enhance the reconstruction quality of non-Lambertian regions (e.g., lips and hair). To model the correspondence between 3D world and texture space, we provide a mesh-aware Gaussian UV mapping method, which leverages UV coordinates given by the FLAME mesh to obtain sharp texture images and real-time rendering speed. A hierarchical optimization strategy is further designed to pursue the optimal performance in both reconstruction quality and editing flexibility. Experiments on the NeRSemble dataset show that SVG-Head not only generates high-fidelity rendering results, but also is the first method to obtain explicit texture images for Gaussian head avatars and support real-time appearance editing.

Wed 22 Oct. 14:15 - 16:15 PDT

#315
Highlight
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data

Ke Fan · Shunlin Lu · Minyue Dai · Runyi Yu · Lixing Xiao · Zhiyang Dou · Junting Dong · Lizhuang Ma · Jingbo Wang

Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training datasets. Moreover, the lack of a comprehensive evaluation framework impedes the advancement of this task by failing to identify directions for improvement. In this work, we aim to push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot. To this end, firstly, we develop an efficient annotation pipeline and introduce MotionMillion—the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences. Additionally, we propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation. Leveraging a scalable architecture, we scale our model to 7B parameters and validate its performance on MotionMillion-Eval. Our results demonstrate strong generalization to out-of-domain and complex compositional motions, marking a significant step toward zero-shot human motion generation.

Wed 22 Oct. 14:15 - 16:15 PDT

#316
StyleMotif: Multi-Modal Motion Stylization using Style-Content Cross Fusion

Ziyu Guo · Young-Yoon Lee · Joseph Liu · Yizhak Ben-Shabat · Victor Zordan · Mubbasir Kapadia

We present SᴛʏʟᴇMᴏᴛɪғ, a novel Stylized Motion Latent Diffusion model, generating motion conditioned on both content and style from multiple modalities. Unlike existing approaches that either focus on generating diverse motion content or transferring style from sequences, SᴛʏʟᴇMᴏᴛɪғ seamlessly synthesizes motion across a wide range of content while incorporating stylistic cues from multi-modal inputs, including motion, text, image, video, and audio. To achieve this, we introduce a style-content cross fusion mechanism and align a style encoder with a pre-trained multi-modal model, ensuring that the generated motion accurately captures the reference style while preserving realism. Extensive experiments demonstrate that our framework surpasses existing methods in stylized motion generation and exhibits emergent capabilities for multi-modal motion stylization, enabling more nuanced motion synthesis. Source code and pre-trained models will be released upon acceptance.

Wed 22 Oct. 14:15 - 16:15 PDT

#317
I2V3D: Controllable Image-to-video Generation with 3D Guidance

Zhiyuan Zhang · Dongdong Chen · Jing Liao

We present I2V3D, a novel framework for animating static images into dynamic videos with precise 3D control, leveraging the strengths of both 3D geometry guidance and advanced generative models. Our approach combines the precision of a computer graphics pipeline, enabling accurate control over elements such as camera movement, object rotation, and character animation, with the visual fidelity of generative AI to produce high-quality videos from coarsely rendered inputs. To support animations with any initial start point and extended sequences, we adopt a two-stage generation process guided by 3D geometry: 1) 3D-Guided Keyframe Generation, where a customized image diffusion model refines rendered keyframes to ensure consistency and quality, and 2) 3D-Guided Video Interpolation, a training-free approach that generates smooth, high-quality video frames between keyframes using bidirectional guidance. Experimental results highlight the effectiveness of our framework in producing controllable, high-quality animations from single input images by harmonizing 3D geometry with generative models. The code for our framework will be publicly released.

Wed 22 Oct. 14:15 - 16:15 PDT

#318
Group-wise Scaling and Orthogonal Decomposition for Domain-Invariant Feature Extraction in Face Anti-Spoofing

Seungjin Jung · Kanghee Lee · Yonghyun Jeong · Haeun Noh · Jungmin Lee · Jongwon Choi

Domain Generalizable Face Anti-Spoofing (DG-FAS) methods effectively capture domain-invariant features by aligning the directions (weights) of local decision boundaries across domains. However, the bias terms associated with these boundaries remain misaligned, leading to inconsistent classification thresholds and degraded performance on unseen target domains.To address this issue, we propose a novel DG-FAS framework that jointly aligns weights and biases through Feature Orthogonal Decomposition (FOD) and Group-wise Scaling Risk Minimization (GS-RM).Specifically, GS-RM facilitates bias alignment by balancing group-wise losses across multiple domains. FOD employs the Gram-Schmidt orthogonalization process to decompose the feature space explicitly into domain-invariant and domain-specific subspaces. By enforcing orthogonality between domain-specific and domain-invariant features during training using domain labels, FOD ensures effective weight alignment across domains without negatively impacting bias alignment.Additionally, we introduce Expected Calibration Error (ECE) as a novel evaluation metric for quantitatively assessing the effectiveness of our method in aligning bias terms across domains. Extensive experiments on benchmark datasets demonstrate that our approach achieves state-of-the-art performance, consistently improving accuracy, reducing bias misalignment, and enhancing generalization stability on unseen target domains.

Wed 22 Oct. 14:15 - 16:15 PDT

#319
FakeRadar: Probing Forgery Outliers to Detect Unknown Deepfake Videos

Zhaolun Li · Jichang Li · Yinqi Cai · Junye Chen · Xiaonan Luo · Guanbin Li · Rushi Lan

In this paper, we propose FakeRadar, a novel deepfake video detection framework designed to address the challenges of cross-domain generalization in real-world scenarios. Existing detection methods typically rely on manipulation-specific cues, performing well on known forgery types but exhibiting severe limitations against emerging manipulation techniques. This poor generalization stems from their inability to adapt effectively to unseen forgery patterns. To overcome this, we leverage large-scale pretrained models (e.g. CLIP) to proactively probe the feature space, explicitly highlighting distributional gaps between real videos, known forgeries, and unseen manipulations. Specifically, FakeRadar introduces Forgery Outlier Probing, which employs dynamic subcluster modeling and cluster-conditional outlier generation to synthesize outlier samples near boundaries of estimated subclusters, simulating novel forgery artifacts beyond known manipulation types. Additionally, we design Outlier-Guided Tri-Training, which optimizes the detector to distinguish real, fake, and outlier samples using proposed outlier-driven contrastive learning and outlier-conditioned cross-entropy losses. Experiments show that FakeRadar outperforms existing methods across various benchmark datasets for deepfake video detection, particularly in cross-domain evaluations, by handling the variety of emerging manipulation techniques.

Wed 22 Oct. 14:15 - 16:15 PDT

#320
StrandHead: Text to Hair-Disentangled 3D Head Avatars Using Human-Centric Priors

Xiaokun Sun · Zeyu Cai · Ying Tai · Jian Yang · Zhenyu Zhang

While haircut indicates distinct personality, existing avatar generation methods fail to model practical hair due to the data limitation or entangled representation. We propose StrandHead, a novel text-driven method capable of generating 3D hair strands and disentangled head avatars with strand-level attributes. Instead of using large-scale hair-text paired data for supervision, we demonstrate that realistic hair strands can be generated from prompts by distilling 2D generative models pre-trained on human mesh data. To this end, we propose a meshing approach guided by strand geometry to guarantee the gradient flow from the distillation objective to the neural strand representation. The optimization is then regularized by statistically significant haircut features, leading to stable updating of strands against unreasonable drifting. These employed 2D/3D human-centric priors contribute to text-aligned and realistic 3D strand generation. Extensive experiments show that StrandHead achieves the state-of-the-art performance on text to strand generation and disentangled 3D head avatar modeling. The generated 3D hair can be applied on avatars for strand-level editing, as well as implemented in the graphics engine for physical simulation or other applications.

Wed 22 Oct. 14:15 - 16:15 PDT

#321
Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Ruining Li · Chuanxia Zheng · Christian Rupprecht · Andrea Vedaldi

We present Puppet-Master, a video generator designed to capture the internal, part-level motion dynamics of objects as a proxy to understand object dynamics universally.Given an image of an object and a set of “drags” specifying the trajectory of a few points of the object, Puppet-Master synthesizes a video where the object parts move accordingly.We extend a pre-trained image-to-video generator with a module that encodes the input drags, and introduce all-to-first attention, a novel alternative to conventional spatial attention that mitigates artifacts caused by fine-tuning a video generator on out-of-domain data.Instead of using real videos, which often intertwine part-level motion with overall object motion, camera movement, and occlusion, we fine-tune Puppet-Master on Objaverse-Animation-HQ, a new dataset of curated part-level motion clips obtained by rendering synthetic 3D animations.We extensively filter out sub-optimal animations and augment the synthetic renderings with meaningful drags to emphasize the internal dynamics of objects.We demonstrate that by using this synthetic dataset, Puppet-Master learns to generate part-level motions, unlike other motion-conditioned video generators that mostly move the object as a whole, and generalizes well to real images, outperforming existing methods on real-world benchmarks in a zero-shot manner.

Wed 22 Oct. 14:15 - 16:15 PDT

#322
CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models

Hao He · Ceyuan Yang · Shanchuan Lin · Yinghao Xu · Meng Wei · Liangke Gui · Qi Zhao · Gordon Wetzstein · Lu Jiang · Hongsheng Li

This paper introduces CameraCtrl II, a framework that enables continuous and dynamic scene exploration through a camera-controlled video diffusion model. Previous camera-conditioned video generative models suffer from diminished video dynamics and limited range of viewpoints when generating videos with large camera motion. We take an approach that progressively expands the generation of dynamic scenes---first enhancing dynamic content within individual clips, then extending these capabilities to create seamless explorations across broad viewpoint ranges. Specifically, we construct a dataset featuring a large degree of dynamics with camera annotation for training while designing a lightweight camera injection module and training scheme to enhance dynamics from pretrained models. Building on these improved single-clip capabilities, we enable extended scene exploration by allowing users to iteratively specify camera trajectories for generating coherent video sequences. Experiments across diverse scenarios demonstrate that CameraCtrl II enables dynamic scene synthesis with substantially wider spatial exploration and enhanced dynamics than previous approaches. We will release the dataset and code.

Wed 22 Oct. 14:15 - 16:15 PDT

#323
General Compression Framework for Efficient Transformer Object Tracking

Lingyi Hong · Jinglun Li · Xinyu Zhou · Shilin Yan · Pinxue Guo · Kaixun Jiang · Zhaoyu Chen · Shuyong Gao · Runze Li · Xingdong Sheng · Wei Zhang · Hong Lu · Wenqiang Zhang

Previous works have attempted to improve tracking efficiency through lightweight architecture design or knowledge distillation from teacher models to compact student trackers. However, these solutions often sacrifice accuracy for speed to a great extent, and also have the problems of complex training process and structural limitations. Thus, we propose a general model compression framework for efficient transformer object tracking, named CompressTracker, to reduce model size while preserving tracking accuracy. Our approach features a novel stage division strategy that segments the transformer layers of the teacher model into distinct stages to break the limitation of model structure. Additionally, we also design a unique replacement training technique that randomly substitutes specific stages in the student model with those from the teacher model, as opposed to training the student model in isolation. Replacement training enhances the student model's ability to replicate the teacher model's behavior and simplifies the training process. To further forcing student model to emulate teacher model, we incorporate prediction guidance and stage-wise feature mimicking to provide additional supervision during the teacher model's compression process. Our framework CompressTracker is structurally agnostic, making it compatible with any transformer architecture. We conduct a series of experiment to verify the effectiveness and generalizability of our CompressTracker. Our CompressTracker-SUTrack, compressed from SUTrack, retains about 99% performance on LaSOT ($\mathbf{72.2\%}$ AUC) while achieves $\mathbf{2.42\times}$ speed up.

Wed 22 Oct. 14:15 - 16:15 PDT

#324
DynamicFace: High-Quality and Consistent Face Swapping for Image and Video using Composable 3D Facial Priors

Runqi Wang · Yang Chen · Sijie Xu · Tianyao He · Wei Zhu · Dejia Song · Nemo Chen · Xu Tang · Yao Hu

Face swapping transfers the identity of a source face to a target face while retaining the attributes like expression, pose, hair, and background of the target face. Advanced face swapping methods have achieved attractive results. However, these methods often inadvertently transfer identity information from the target face, compromising expression-related details and accurate identity. We propose a novel method DynamicFace that leverages the power of diffusion models and plug-and-play adaptive attention layers for image and video face swapping. First, we introduce four fine-grained facial conditions using 3D facial priors. All conditions are designed to be disentangled from each other for precise and unique control. Then, we adopt Face Former and ReferenceNet for high-level and detailed identity injection. Through experiments on the FF++ dataset, we demonstrate that our method achieves state-of-the-art results in face swapping, showcasing superior image quality, identity preservation, and expression accuracy. Our framework seamlessly adapts to both image and video domains. Our code and results will be available on the project page: https://dynamic-face.github.io/.

Wed 22 Oct. 14:15 - 16:15 PDT

#325
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition

Xin Ding · Hao Wu · Yifan Yang · Shiqi Jiang · Qianxi Zhang · Donglin Bai · Zhibo Chen · Ting Cao

With the rise of real-world human-AI interaction applications, such as AI assistants, the need for Streaming Video Dialogue is critical. To address this need, we introduce StreamMind, a video LLM framework that achieves ultra-FPS streaming video processing (100 fps on a single A100) and enables proactive, always-on responses in real time, without explicit user intervention. To solve the key challenge of the contradiction between linear video streaming speed and quadratic transformer computation cost, we propose a novel perception-cognition interleaving paradigm named ''event-gated LLM invocation'', in contrast to the existing per-time-step LLM invocation. By introducing a Cognition Gate network between the video encoder and the LLM, LLM is only invoked when relevant events occur. To realize the event feature extraction with constant cost, we propose Event-Preserving Feature Extractor (EPFE) based on state-space method, generating a single perception token for spatiotemporal features. These techniques enable the video LLM with full-FPS perception and real-time cognition response. Experiments on Ego4D and SoccerNet streaming tasks, as well as standard offline benchmarks, demonstrate state-of-the-art performance in both model capability and real-time efficiency, paving the way for ultra-high-FPS applications, such as Game AI Copilot and interactive media.

Wed 22 Oct. 14:15 - 16:15 PDT

#326
CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction

Zhefei Gong · Pengxiang Ding · Shangke Lyu · Siteng Huang · Mingyang Sun · Wei Zhao · Zhaoxin Fan · Donglin Wang

In robotic visuomotor policy learning, diffusion-based models have achieved significant success in improving the accuracy of action trajectory generation compared to traditional autoregressive models. However, they suffer from inefficiency due to multiple denoising steps and limited flexibility from complex constraints. In this paper, we introduce Coarse-to-Fine AutoRegressive Policy (CARP), a novel paradigm for visuomotor policy learning that redefines the autoregressive action generation process as a coarse-tofine, next-scale approach. CARP decouples action generation into two stages: first, an action autoencoder learns multi-scale representations of the entire action sequence; then, a GPT-style transformer refines the sequence prediction through a coarse-to-fine autoregressive process. This straightforward and intuitive approach produces highly accurate and smooth actions, matching or even surpassing the performance of diffusion-based policies while maintaining efficiency on par with autoregressive policies. We conduct extensive evaluations across diverse settings, including single-task and multi-task scenarios on state-based and image-based simulation benchmarks, as well as real-world tasks. CARP achieves competitive success rates, with up to a 10% improvement, and delivers 10× faster inference compared to state of-the-art policies, establishing a high-performance, efficient, and flexible paradigm for action generation in robotic tasks.

Wed 22 Oct. 14:15 - 16:15 PDT

#327
Unlocking the Potential of Diffusion Priors in Blind Face Restoration

Yunqi Miao · Zhiyu Qu · Mingqi Gao · Changrui Chen · Jifei Song · Jungong Han · Jiankang Deng

Although diffusion prior is rising as a powerful solution for blind face restoration (BFR), the inherent gap between the vanilla diffusion model and BFR settings hinders its seamless adaptation. The gap mainly stems from the discrepancy between 1) high-quality (HQ) and low-quality (LQ) images and 2) synthesized and real-world images.The vanilla diffusion model is trained on images with no or less degredations, while BFR handles moderately to severely degraded images.Additionally, LQ images used for training are synthesized by a naive degradation model with limited degradation patterns, which fails to simulate the complex and unknown degradations in real-world scenarios.In this work, we use a unified network FLIPNET that switches between two modes to address specific gaps.In restoration mode, the model gradually integrates BFR-oriented features and face embeddings from LQ images to achieve authentic and faithful face restoration.In degradation mode, the model synthesizes real-world like degraded images based on the knowledge learned from real-world degradation datasets.Extensive evaluations on benchmark datasets show that our model 1) outperforms previous diffusion prior based BFR methods in terms of authenticity and fidelity, and 2) outperforms the naive degradation model in modeling the real-world degradations.

Wed 22 Oct. 14:15 - 16:15 PDT

#328
Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework

Yi-Ting Chen · Ting-Hsuan Liao · Pengsheng Guo · Alex Schwing · Jia-Bin Huang

We propose 3D Super Resolution (3DSR), a novel 3D Gaussian-splatting-based super-resolution framework that leverages off-the-shelf diffusion-based 2D super-resolution models. 3DSR encourages 3D consistency across views via the use of an explicit unifying 3D Gaussian-splatting-based scene representation. This makes the proposed 3DSR different from prior work, such as image upsampling or the use of video super-resolution, which either don't consider 3D consistency or aim to incorporate 3D consistency implicitly. Notably, our method enhances visual quality without additional fine-tuning, ensuring spatial coherence within the reconstructed scene. We evaluate 3DSR on MipNeRF360 and LLFF data, demonstrating that it produces high-resolution results that are visually compelling while maintaining structural consistency in 3D reconstructions. Code will be released.

Wed 22 Oct. 14:15 - 16:15 PDT

#329
A₀ : An Affordance-Aware Hierarchical Model for General Robotic Manipulation

Rongtao Xu · Jian Zhang · Minghao Guo · Youpeng Wen · Haoting Yang · Min Lin · Jianzheng Huang · Zhe Li · Kaidong Zhang · Liqiong Wang · Yuxuan Kuang · Meng Cao · Feng Zheng · Xiaodan Liang

Robotic manipulation faces critical challenges in understanding spatial affordances—the "where" and "how" of object interactions—essential for complex manipulation tasks like wiping a board or stacking objects. Existing methods, including modular-based and end-to-end approaches, often lack robust spatial reasoning capabilities. Unlike recent point-based and flow-based affordance methods that focus on dense spatial representations or trajectory modeling, we propose A₀, a hierarchical affordance-aware diffusion model that decomposes manipulation task into high-level spatial affordance understanding and low-level action execution. A₀ leverages the Embodiment-Agnostic Affordance Representation, which captures object-centric spatial affordances by predicting contact point and post-contact trajectories. A₀ is pre-trained on 1 million contact points data and fine-tuned on annotated trajectories, enabling generalization across platforms. Key components include Position Offset Attention for motion-aware feature extraction and a Spatial Information Aggregation Layer for precise coordinate mapping. The model’s output is executed by the action execution module. Experiments on multiple robotic systems (Franka, Kinova, Realman and Dobot) demonstrate A₀'s superior performance in complex tasks, showcasing its efficiency, flexibility, and real-world applicability.

Wed 22 Oct. 14:15 - 16:15 PDT

#330
Highlight
DisenQ: Disentangling Q-Former for Activity-Biometrics

Shehreen Azad · Yogesh Rawat

In this work, we address activity-biometrics, which involves identifying individuals across diverse set of activities. Unlike traditional person identification, this setting introduces additional challenges as identity cues become entangled with motion dynamics and appearance variations, making biometrics feature learning more complex. While additional visual data like pose and/or silhouette help, they often struggle from extraction inaccuracies. To overcome this, we propose a multimodal language-guided framework that replaces reliance on additional visual data with structured textual supervision. At its core, we introduce DisenQ (Disentangling Q-Former), a unified querying transformer that disentangles biometrics, motion, and non-biometrics features by leveraging structured language guidance. This ensures identity cues remain independent of appearance and motion variations, preventing misidentifications. We evaluate our approach on three activity-based video benchmarks, achieving state-of-the-art performance. Additionally, we demonstrate strong generalization to complex real-world scenario with competitive performance on a traditional video-based identification benchmark, showing the effectiveness of our framework.

Wed 22 Oct. 14:15 - 16:15 PDT

#331
The Source Image is the Best Attention for Infrared and Visible Image Fusion

Song Wang · Xie Han · Liqun Kuang · Boying Wang · Zhongyu Chen · Zherui Qiao · Fan Yang · Xiaoxia Liu · Bingyu Zhang · Zhixun Wang

Infrared and visible image fusion (IVF) aims to generate informative fused images by combining the merits of different modalities. In this paper, we uncover the inherent "attention properties" of infrared images, which directly arise from their physical characteristics and can be linked to attention mechanisms naturally, as observed in the gradient-weighted class activation mapping (Grad-CAM) visualization results of image classification models. To incorporate this property into IVF for better fusion, we propose the source infrared cross attention (I-SCA). Furthermore, we extend this discovery to visible images and introduce the source visible cross attention (V-SCA). The joint use of I-SCA and V-SCA addresses longstanding issues in image fusion, such as insufficient and incomplete multimodal feature interaction and fusion. Moreover, to solve the problem of mismatched channel numbers between the source images and intermediate features, which makes it impossible to apply the attention equation directly, and to minimize the domain gap between their respective feature spaces, an adaptive channel boosting and intelligent space mapping module (CBSM) is introduced. Specifically, we treat the CBSM-processed raw image as the query, while the intermediate features of another modality are treated as keys and values in I-SCA and V-SCA. Unlike attention mechanisms that divide images into patches or limit computations to local windows, we achieve smoother and more robust IVF through true global modeling across the entire image space in the source image attention, with linear complexity. Comparison with current SOTA methods on three popular public datasets confirms the superiority of our method.

Wed 22 Oct. 14:15 - 16:15 PDT

#332
HUST: High-Fidelity Unbiased Skin Tone Estimation via Texture Quantization

Zimin Ran · Xingyu Ren · Xiang An · Kaicheng Yang · Ziyong Feng · Jing Yang · Rolandos Alexandros Potamias · Linchao Zhu · Jiankang Deng

Recent 3D facial reconstruction methods have made significant progress in shape estimation, but high-fidelity unbiased facial albedo estimation remains challenging. Existing methods rely on expensive light-stage captured data, and while they have made progress in either high-fidelity reconstruction or unbiased skin tone estimation, no work has yet achieved optimal results in both aspects simultaneously. In this paper, we present a novel high-fidelity unbiased facial diffuse albedo reconstruction method, HUST, which recovers the diffuse albedo map directly from a single image without the need for captured data. Our key insight is that the albedo map is the illumination-invariant texture map, which enables us to use inexpensive texture data for diffuse albedo estimation by eliminating illumination. To achieve this, we collect large-scale high-resolution facial images and train a VQGAN model in the image space. To adapt the pre-trained VQGAN model for UV texture generation, we fine-tune the encoder by using limited UV textures and our high-resolution faces under adversarial supervision in both image and latent space. Finally, we train a cross-attention module and utilize group identity loss for the domain adaptation from texture to albedo. Extensive experiments demonstrate that HUST can predict high-fidelity facial albedos for in-the-wild images. On the FAIR benchmark, HUST achieves the lowest average ITA error (11.20) and bias score (1.58), demonstrating superior accuracy and robust fairness across the entire spectrum of human skin tones. Our code, models, and training data will be made publicly available to facilitate future research.

Wed 22 Oct. 14:15 - 16:15 PDT

#333
AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion

Yangyi Huang · Ye Yuan · Xueting Li · Jan Kautz · Umar Iqbal

Existing methods for image-to-3D avatar generation struggle to produce highly detailed, animation-ready avatars suitable for real-world applications. We introduce AdaHuman, a novel framework that generates high-fidelity animatable 3D avatars from a single in-the-wild image. AdaHuman incorporates two key innovations: (1) A pose-conditioned 3D joint diffusion model that synthesizes consistent multi-view images in arbitrary poses alongside corresponding 3D Gaussian Splats (3DGS) reconstruction at each diffusion step; (2) A compositional 3DGS refinement module that enhances the details of local body parts through image-to-image refinement and seamlessly integrates them using a novel crop-aware camera ray map, producing a cohesive detailed 3D avatar. These components allow AdaHuman to generate highly realistic standardized A-pose avatars with minimal self-occlusion, enabling rigging and animation with any input motion. Extensive evaluation on public benchmarks and in-the-wild images demonstrates that AdaHuman significantly outperforms state-of-the-art methods in both avatar reconstruction and reposing. Code and models will be publicly available for research purposes.

Wed 22 Oct. 14:15 - 16:15 PDT

#334
Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition

Pulkit Kumar · Shuaiyi Huang · Matthew Walmer · Sai Saketh Rambhatla · Abhinav Shrivastava

Video understanding requires effective modeling of both motion and appearance information, particularly for few-shot action recognition. While recent advances in point tracking have been shown to improve few-shot action recognition, two fundamental challenges persist: selecting informative points to track and effectively modeling their motion patterns. We present Trokens, a novel approach that transforms trajectory points into semantic-aware relational tokens for action recognition. First, we introduce a semantic-aware sampling strategy to adaptively distribute tracking points based on object scale and semantic relevance. Second, we develop a motion modeling framework that captures both intra-trajectory dynamics through the Histogram of Oriented Displacements (HoD) and inter-trajectory relationships to model complex action patterns. Our approach effectively combines these trajectory tokens with semantic features to enhance appearance features with motion information, achieving state-of-the-art performance across six diverse few-shot action recognition benchmarks: Something-Something-V2 (both full and small splits), Kinetics, UCF101, HMDB51, and FineGym.

Wed 22 Oct. 14:15 - 16:15 PDT

#335
AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation

zijie wu · Chaohui Yu · Fan Wang · Xiang Bai

Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data. In this paper, we present AnimateAnyMesh, the first feed-forward framework that enables efficient text-driven animation of arbitrary 3D meshes. Our approach leverages a novel DyMeshVAE architecture that effectively compresses and reconstructs dynamic mesh sequences by disentangling spatial and temporal features while preserving local topological structures. To enable high-quality text-conditional generation, we employ a Rectified Flow-based training strategy in the compressed latent space. Additionally, we contribute the DyMesh Dataset, containing over 4M diverse dynamic mesh sequences with text annotations. Experimental results demonstrate that our method generates semantically accurate and temporally coherent mesh animations in a few seconds, significantly outperforming existing approaches in both quality and efficiency. Our work marks a substantial step forward in making 4D content creation more accessible and practical. All the data, code, and models will be open-released.

Wed 22 Oct. 14:15 - 16:15 PDT

#336
Auto-Regressive Transformation for Image Alignment

Kanggeon Lee · Soochahn Lee · Kyoung Mu Lee

Existing methods for image alignment struggle in cases involving feature-sparse regions, extreme scale and field-of-view differences, and large deformations, often resulting in suboptimal accuracy.Robustness to these challenges improves through iterative refinement of the transformation field while focusing on critical regions in multi-scale image representations.We thus propose Auto-Regressive Transformation (ART), a novel method that iteratively estimates the coarse-to-fine transformations within an auto-regressive framework. Leveraging hierarchical multi-scale features, our network refines the transformations using randomly sampled points at each scale.By incorporating guidance from the cross-attention layer, the model focuses on critical regions, ensuring accurate alignment even in challenging, feature-limited conditions.Extensive experiments across diverse datasets demonstrate that ART significantly outperforms state-of-the-art methods, establishing it as a powerful new method for precise image alignment with broad applicability.

Wed 22 Oct. 14:15 - 16:15 PDT

#337
Controllable Weather Synthesis and Removal with Video Diffusion Models

Chih-Hao Lin · Zian Wang · Ruofan Liang · Yuxuan Zhang · Sanja Fidler · Shenlong Wang · Zan Gojcic

Generating realistic and controllable weather effects in videos is valuable for many applications. Physics-based weather simulation requires precise reconstructions that are hard to scale to in-the-wild videos, while current video editing often lacks realism and control.In this work, we introduce WeatherWeaver, a video diffusion model that synthesizes diverse weather effects---including rain, snow, fog, and clouds---directly into any input video without the need for 3D modeling.Our model provides precise control over weather effect intensity and supports blending various weather types, ensuring both realism and adaptability.To overcome the scarcity of paired training data, we propose a novel data strategy combining synthetic videos, generative image editing, and auto-labeled real-world videos. Extensive evaluations show that our method outperforms state-of-the-art methodsin weather simulation and removal, providing high-quality, physically plausible, and scene-identity-preserving results over various real-world videos.

Wed 22 Oct. 14:15 - 16:15 PDT

#338
Sequential Gaussian Avatars with Hierarchical Motion Context

Wangze Xu · Yifan Zhan · Zhihang Zhong · Xiao Sun

The emergence of neural rendering has significantly advanced the rendering quality of 3D human avatars, with the recently popular 3DGS technique enabling real-time performance. However, SMPL-driven 3DGS human avatars still struggle to capture fine appearance details due to the complex mapping from pose to appearance during fitting. In this paper, we excavate the explicit 3DGS representation to better model human avatars based on a hierarchical motion context. Specifically, we utilize a coarse-to-fine motion conditions that incorporate both overall human skeleton and fine-grained vertex motions for non-rigid deformation. To enhance the robustness of the proposed motion conditions, we adopt a spatio-temporal muli-scale sampling strategy to hierarchically integrate more motion clues to model human avatars. Extensive experiments demonstrate that our method significantly outperforms 3DGS-based approaches and renders human avatars orders of magnitude faster than the latest NeRF-based models that incorporate temporal context, all while delivering performance that is at least comparable or even superior.

Wed 22 Oct. 14:15 - 16:15 PDT

#339
TokenUnify: Scaling Up Autoregressive Pretraining for Neuron Segmentation

Yinda Chen · Haoyuan Shi · Xiaoyu Liu · Te Shi · Ruobing Zhang · Dong Liu · Zhiwei Xiong · Feng Wu

Neuron segmentation from electron microscopy (EM) volumes is crucial for understanding brain circuits, yet the complex neuronal structures in high-resolution EM images present significant challenges. Inspired by autoregressive pretraining in language models, we propose TokenUnify, a hierarchical predictive coding framework that captures multi-scale dependencies through complementary learning objectives. TokenUnify integrates random token prediction, next-token prediction, and next-all token prediction to create a comprehensive representational space with emergent properties. From an information-theoretic perspective, these three tasks are complementary and provide optimal coverage of visual data structure. We also introduce a large-scale EM dataset with 1.2 billion annotated voxels, offering ideal long-sequence visual data with spatial continuity. Leveraging the Mamba architecture's linear-time sequence modeling capabilities, TokenUnify achieves a 45\% performance improvement on downstream neuron segmentation and outperforms MAE by 21\%. Our approach demonstrates superior scaling properties as model size increases, effectively bridging the gap between pretraining strategies for language and vision models.

Wed 22 Oct. 14:15 - 16:15 PDT

#340
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Shuangrui Ding · Rui Qian · Xiaoyi Dong · Pan Zhang · Yuhang Zang · Yuhang Cao · Yuwei Guo · Dahua Lin · Jiaqi Wang

The Segment Anything Model 2 (SAM 2) has emerged as a powerful foundation model for object segmentation in both images and videos. The crucial design of SAM 2 for video segmentation is its memory module, which prompts object-aware memories from previous frames for current frame prediction. However, its greedy-selection memory design suffers from the ``error accumulation" problem, where an errored or missed mask will cascade and influence the segmentation of the subsequent frames, which limits the performance of SAM 2 toward complex long-term videos. To this end, we introduce SAM2Long, an improved training-free video object segmentation strategy, which considers the segmentation uncertainty within each frame and chooses the video-level optimal results from multiple segmentation pathways in a constrained tree search manner. In practice, we maintain a fixed number of segmentation pathways throughout the video. For each frame, multiple masks are proposed based on the existing pathways, creating various candidate branches. We then select the same fixed number of branches with higher cumulative scores as the new pathways for the next frame. After processing the final frame, the pathway with the highest cumulative score is chosen as the final segmentation result. Benefiting from its heuristic search design, SAM2Long is robust toward occlusions and object reappearances, and can effectively segment and track objects for complex long-term videos. Without introducing any additional parameters or further training, SAM2Long significantly and consistently outperforms SAM 2 on nine VOS benchmarks and three VOT benchmarks. Notably, SAM2Long achieves an average improvement of 3.7 points across all 12 direct comparisons, with gains of up to 5.3 points in J&F on long-term video object segmentation benchmarks such as SA-V and LVOS.

Wed 22 Oct. 14:15 - 16:15 PDT

#341
T2Bs: Text-to-Character Blendshapes via Video Generation

Jiahao Luo · Chaoyang Wang · Michael Vasilkovsky · Vladislav Shakhrai · Di Liu · Peiye Zhuang · Sergey Tulyakov · Peter Wonka · Hsin-Ying Lee · James Davis · Jian Wang

We propose a new framework to create high-quality character head morphable models from text, combining static text-to-3D generation with video diffusion. Bridging the gap between these two methods is challenging: text-to-3D models produce detailed static geometry but cannot synthesize motion, while video diffusion models generate motion but face consistency issues like varying colors, varying viewpoints, or geometric distortion. Our solution uses deformable 3D Gaussian splatting to align static 3D models with video diffusion outputs, enabling the creation of a set of diverse, expressive motions with greater accuracy. By incorporating static geometry as a constraint and using a view-dependent deformation MLP, we reduce video artifacts and produce coherent, consistent results. This approach allows us to build a 3D morphable model that can generate new, realistic expressions. Compared to existing 4D generation techniques, our method achieves superior results and creates expressive character head models that can be animated.

Wed 22 Oct. 14:15 - 16:15 PDT

#342
TACO: Taming Diffusion for in-the-wild Video Amodal Completion

Ruijie Lu · Yixin Chen · Yu Liu · Jiaxiang Tang · Junfeng Ni · Diwen Wan · Gang Zeng · Siyuan Huang

Humans can infer complete shapes and appearances of objects from limited visual cues, relying on extensive prior knowledge of the physical world. However, completing partially observable objects while ensuring consistency across video frames remains challenging for existing models, especially for unstructured, in-the-wild videos. This paper tackles the task of Video Amodal Completion (VAC), which aims to generate the complete object consistently throughout the video given a visual prompt specifying the object of interest. Leveraging the rich, consistent manifolds learned by pre-trained video diffusion models, we propose a conditional diffusion model, TACO, that repurposes these manifolds for VAC. To enable its effective and robust generalization to challenging in-the-wild scenarios, we curate a large-scale synthetic dataset with multiple difficulty levels by systematically imposing occlusions onto un-occluded videos. Building on this, we devise a progressive fine-tuning paradigm that starts with simpler recovery tasks and gradually advances to more complex ones. We demonstrate TACO's versatility on a wide range of in-the-wild videos from Internet, as well as on diverse, unseen datasets commonly used in autonomous driving, robotic manipulation, and scene understanding. Moreover, we show that TACO can be effectively applied to various downstream tasks like object reconstruction and pose estimation, highlighting its potential to facilitate physical world understanding and reasoning.

Wed 22 Oct. 14:15 - 16:15 PDT

#343
Unfolding-Associative Encoder-Decoder Network with Progressive Alignment for Pansharpening

Shijie Fang · Hongping Gan

Deep Unfolding Networks (DUNs) have emerged as a powerful framework for pansharpening due to their interpretable fusion strategies. However, existing DUNs are limited by their serial iterative architectures, which hinder cross-stage and cross-modal feature interactions at different abstraction levels. This limitation results in insufficient integration of multi-level multimodal features and compromised reconstruction accuracy. To address these challenges, we propose the Unfolding-Associative Encoder-Decoder Network (UED-Net), an innovative framework that iteratively extracts multi-level cross-modal degradation encodings and recursively refines features for cross-stage adaptive aggregation decoding through lightweight processes. Specifically, we first introduce the spatial-spectral encoding module, which progressively and interpretably perceives the hierarchical degradation encoding features of both space and spectrum. Moreover, we develop the unfolding-associative attention module to capture pixel-level attention across stages, thereby leveraging the causal relationships of multi-level features for aggregation during decoding. Meanwhile, we implement a progressive alignment mechanism, which coordinates both feature distribution and alignment of spatial and spectral modalities between iterative stages to facilitate adaptive fusion. These modules enable UED-Net to achieve efficient pansharpening by aggregating multi-level features. Extensive qualitative and quantitative experiments confirm the superiority of UED-Net.

Wed 22 Oct. 14:15 - 16:15 PDT

#344
AnyBimanual: Transferring Unimanual Policy for General Bimanual Manipulation

Guanxing Lu · Tengbo Yu · Haoyuan Deng · Season Chen · Yansong Tang · Ziwei Wang

Performing general language-conditioned bimanual manipulation tasks is of great importance for many applications ranging from household service to industrial assembly. However, collecting bimanual manipulation data is expensive due to the high-dimensional action space, which poses challenges for conventional methods to handle general bimanual manipulation tasks. In contrast, unimanual policy has recently demonstrated impressive generalizability across a wide range of tasks because of scaled model parameters and training data, which can provide sharable manipulation knowledge for bimanual systems. To this end, we propose a plug-and-play method named AnyBimanual, which transfers pretrained unimanual policy to general bimanual manipulation policy with few bimanual demonstrations. Specifically, we first introduce a skill manager to dynamically schedule the skill representations discovered from pretrained unimanual policy for bimanual manipulation tasks, which linearly combines skill primitives with task-oriented compensation to represent the bimanual manipulation instruction. To mitigate the observation discrepancy between unimanual and bimanual systems, we present a visual aligner to generate soft masks for visual embedding of the workspace, which aims to align visual input of unimanual policy model for each arm with those during pretraining stage. AnyBimanual shows superiority on 12 simulated tasks from RLBench2 with a sizable 17.33% improvement in success rate over previous methods. Experiments on 9 real-world tasks further verify its practicality with an average success rate of 84.62%.

Wed 22 Oct. 14:15 - 16:15 PDT

#345
MOERL: When Mixture-of-Experts Meet Reinforcement Learning for Adverse Weather Image Restoration

Tao Wang · Peiwen Xia · Bo Li · Peng-Tao Jiang · Zhe Kong · Kaihao Zhang · Tong Lu · Wenhan Luo

Adverse weather conditions, such as rain, snow, and haze, introduce complex degradations that present substantial challenges for effective image restoration. Existing all-in-one models often rely on fixed network structures, limiting their ability to adapt to the varying characteristics of different weather conditions. Moreover, these models typically lack the iterative refinement process that human experts use for progressive image restoration. In this work, we propose MOERL, a Mixture-of-Experts (MoE) model optimized with reinforcement learning (RL) to enhance image restoration across diverse weather conditions. Our method incorporates two core types of experts, i.e., channel-wise modulation and spatial modulation experts to address task-specific degradation characteristics while minimizing task interference. In addition, inspired by human expertise, we frame the optimization process as a sequential, progressive problem, allowing the network to refine its parameters progressively and adapt to specific weather conditions. Extensive experiments demonstrate the efficacy and superiority of our proposed method. The code and pre-trained models will be available.

Wed 22 Oct. 14:15 - 16:15 PDT

#346
LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling

Li Huaqiu · Yong Wang · Tongwen Huang · Hailang Huang · Haoqian Wang · Xiangxiang Chu

Unified image restoration is a significantly challenging task in low-level vision. Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model. Our method incorporates the multimodal understanding model to provide sematic priors for the generative model under a task-blind condition. Furthermore, it utilizes a lightweight module to align the degraded input with the generated preference of the diffusion model, and employs recurrent refinement for posterior sampling. The proposed method enables zero-shot unified image restoration without the need for any prior knowledge of specific task types and degradation modeling. Extensive experiments demonstrate that our method outperforms state-of-the-art methods, validating its effectiveness and robustness. Our code and data will be made publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#347
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion

Wenqiang Sun · Shuo Chen · Fangfu Liu · Zilong Chen · Yueqi Duan · Jun Zhu · Jun Zhang · Yikai Wang

In this paper, we introduce DimensionX, a framework designed to generate photorealistic 3D and 4D scenes from just a single image with video diffusion. Our approach begins with the insight that both the spatial structure of a 3D scene and the temporal evolution of a 4D scene can be effectively represented through sequences of video frames. While recent video diffusion models have shown remarkable success in producing vivid visuals, they face limitations in directly recovering 3D/4D scenes due to poor spatial and temporal controllability during generation. To overcome this difficulty, we propose ST-Director, which decouples spatial and temporal factors in video diffusion by learning dimension-aware directors from dimension-variant data. This decoupled video diffusion enables precise manipulation of spatial structures and temporal dynamics, allowing us to reconstruct both 3D and 4D representations from sequential frames by combining spatial and temporal dimensions. Additionally, to bridge the gap between generated videos and real-world scenes, we introduce a trajectory-aware mechanism for 3D generation and an identity-preserving denoising strategy for 4D generation, respectively. Extensive experiments on various real-world and synthetic datasets demonstrate that DimensionX achieves state-of-the-art performance in decoupled video generation, as well as 3D and 4D scene generation.

Wed 22 Oct. 14:15 - 16:15 PDT

#348
RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation

Feng yan · Fanfan Liu · Yiyang Huang · ZechaoGuan ZechaoGuan · Liming Zheng · Yufeng Zhong · Chengjian Feng · Lin Ma

In recent years, robotics has advanced significantly through the integration of larger models and large-scale datasets. However, challenges remain in applying these models to 3D spatial interactions and managing data collection costs. To address these issues, we propose the multimodal robotic manipulation model, \textit{RoboMM}, along with the comprehensive dataset, \textit{RoboData}.\textit{RoboMM} enhances 3D perception through camera parameters and occupancy supervision. Building on OpenFlamingo, it incorporates Modality-Isolation-Mask and multimodal decoder blocks, improving modality fusion and fine-grained perception. \textit{RoboData} offers the complete evaluation system by integrating several well-known datasets, achieving the first fusion of multi-view images, camera parameters, depth maps, and actions, and the space alignment facilitates comprehensive learning from diverse robotic datasets.Equipped with \textit{RoboData} and the unified physical space, \textit{RoboMM} is the first generalist policy that surpasses expert models, enabling simultaneous evaluation of all tasks across multiple datasets, rather than being limited to specific data or task selections.Its design significantly enhances robotic manipulation performance, increasing the average sequence length on the CALVIN from 1.7 to 3.5 and ensuring cross-embodiment capabilities, achieving state-of-the-art results across multiple datasets, including both simulated and real-world data.

Wed 22 Oct. 14:15 - 16:15 PDT

#349
LOMM: Latest Object Memory Management for Temporally Consistent Video Instance Segmentation

Seunghun Lee · Jiwan Seo · Minwoo Choi · Kiljoon Han · Jaehoon Jeong · Zane Durante · Ehsan Adeli · Sang Hyun Park · Sunghoon Im

In this paper, we present Latest Object Memory Management (LOMM) for temporally consistent video instance segmentation that significantly improves long-term instance tracking. At the core of our method is Latest Object Memory (LOM), which robustly tracks and continuously updates the latest states of objects by explicitly modeling their presence in each frame. This enables consistent tracking and accurate identity management across frames, enhancing both performance and reliability through the VIS process. Moreover, we introduce Decoupled Object Association (DOA), a strategy that separately handles newly appearing and already existing objects. By leveraging our memory system, DOA accurately assigns object indices, improving matching accuracy and ensuring stable identity consistency, even in dynamic scenes where objects frequently appear and disappear. Extensive experiments and ablation studies demonstrate the superiority of our method over traditional approaches, setting a new benchmark in VIS. Notably, our LOMM achieves state-of-the-art AP score of 54.0 on YouTube-VIS 2022, a dataset known for its challenging long videos.

Wed 22 Oct. 14:15 - 16:15 PDT

#350
Highlight
Video Motion Graphs

Haiyang Liu · Zhan Xu · Fating Hong · Hsin-Ping Huang · Yi Zhou · Yang Zhou

We present Video Motion Graphs, a system designed to generate realistic human motion videos. Using a reference video and conditional signals such as music or motion tags, the system synthesizes new videos by first retrieving video clips with gestures matching the conditions and then generating interpolation frames to seamlessly connect clip boundaries. The core of our approach is HMInterp, a robust Video Frame Interpolation (VFI) model that enables seamless interpolation of discontinuous frames, even for complex motion scenarios like dancing. HMInterp i) employs a dual-branch interpolation approach, combining a Motion Diffusion Model for human skeleton motion interpolation with a diffusion-based video frame interpolation model for final frame generation. ii) adopts condition progressive training to effectively leverage identity strong and weak conditions, such as images and pose. These designs ensure both high video texture quality and accurate motion trajectory. Our Video Motion Graphs outperforms existing generative- and retrieval-based methods for human motion video generation. Our codes and pretrained models are public available.

Wed 22 Oct. 14:15 - 16:15 PDT

#351
Online Generic Event Boundary Detection

Hyung Rok Jung · Daneul Kim · Seunggyun Lim · Jeany Son · Jonghyun Choi

Generic Event Boundary Detection (GEBD) aims to interpret long-form videos through the lens of human perception. However, current GEBD methods rely on complete video frames for prediction, which contrasts with the human ability to process information online and in real time. To bridge this gap, we introduce a new task, Online Generic Event Boundary Detection (On-GEBD), which aims to detect boundaries of generic events immediately in streaming videos. This task faces unique challenges of identifying subtle, taxonomy-free event changes in real-time, without access to future frames. To tackle these challenges, we propose a novel On-GEBD framework, $\textit{ESTimator}$, inspired by Event Segmentation Theory (EST) which explains how humans segment ongoing activity into events by leveraging the discrepancies between predicted and actual information. Our framework consists of two key components: the Consistent Event Anticipator (CEA), and the Online Boundary Discriminator (OBD). Specifically, the CEA generates a prediction of the future frame reflecting current event dynamics based solely on prior frames. Then, the OBD computes the discrepancy between the prediction and the actual incoming frame, adaptively adjusting the error threshold using statistical tests on historical errors to capture diverse and subtle event transitions. Experimental results demonstrate that $ESTimator$ outperforms all baselines adapted from recent online video understanding models and achieves performance comparable to prior offline-GEBD methods on the Kinetics-GEBD and TAPOS datasets.

Language-conditioned robot manipulation in the continuous spectrum presents a persistent challenge due to the difficult of mapping states to target actions. Previous methods face limitations in effectively modeling object states, primarily due to their reliance on executing ambiguous instructions devoid of explicit state information. In response, we present SD$^2$Actor, a zero-shot robotic manipulation framework that possesses the capability to generate precise actions in continuous states. Specifically, given the novel instructions, we aim to generate instruction-following and accurate robot manipulation actions. Instead of time-consuming optimization and finetuning, our zero-shot method generalizes to any object state with a wide range of translations and versatile rotations. At its core, we quantify multiple base states in the training set and utilize their combination to refine the target action generated by the diffusion model. To obtain novel state representations, we initially employ LLMs to extract the novel state from the instruction and decompose it into multiple learned base states. We then employ the linear combination of base state embeddings to produce novel state features. Moreover, we introduce the orthogonalization loss to constrain the state embedding space, which ensures the validity of linear interpolation. Experiments demonstrate that SD$^2$Actor outperforms state-of-the-art methods across a diverse range of manipulation tasks in ARNOLD Benchmark. Moreover, SD$^2$Actor can effectively learn generalizable policies from a limited number of human demonstrations, achieving promising accuracy in a variety of real-world manipulation tasks.

Wed 22 Oct. 14:15 - 16:15 PDT

#353
SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis

Xiangyue Zhang · Jianfang Li · Jiaxu Zhang · Ziqiang Dang · Jianqiang Ren · Liefeng Bo · Zhigang Tu

A good co-speech motion generation cannot be achieved without a careful integration of common rhythmic motion and rare yet essential semantic motion. In this work, we propose SemTalk for holistic co-speech motion generation with frame-level semantic emphasis. Our key insight is to separately learn base motions and sparse motions, and then adaptively fuse them. In particular, coarse2fine cross-attention module and rhythmic consistency learning are explored to establish rhythm-related base motion, ensuring a coherent foundation that synchronizes gestures with the speech rhythm. Subsequently, semantic emphasis learning is designed to generate semantic-aware sparse motion, focusing on frame-level semantic cues. Finally, to integrate sparse motion into the base motion and generate semantic-emphasized co-speech gestures, we further leverage a learned semantic score for adaptive synthesis. Qualitative and quantitative comparisons on two public datasets demonstrate that our method outperforms the state-of-the-art, delivering high-quality co-speech motion with enhanced semantic richness over a stable base motion.

In this paper, a contrastive representation learning framework is proposed to enhance human action segmentation via pre-training using trimmed (single action) skeleton sequences. Unlike previous representation learning works that are tailored for action recognition and that build upon isolated sequence-wise representations, the proposed framework focuses on exploiting multi-scale representations in conjunction with cross-sequence variations. More specifically, it proposes a novel data augmentation strategy, “Shuffle and Warp”, which exploits diverse multi-action permutations. The latter effectively assists two surrogate tasks that are introduced in contrastive learning: Cross Permutation Contrasting (CPC) and Relative Order Reasoning (ROR). In optimization, CPC learns intra-class similarities by contrasting representations of the same action class across different permutations, while ROR reasons about inter-class contexts by predicting relative mapping between two permutations. Together, these tasks enable a Dual-Surrogate Contrastive Learning (DuoCLR) network to learn multi-scale feature representations optimized for action segmentation. In experiments, DuoCLR is pre-trained on a trimmed skeleton dataset and evaluated on an untrimmed dataset where it demonstrates a significant boost over state-the-art comparatives in both multi-class and multi-label action segmentation tasks. Lastly, ablation studies are conducted to evaluate the effectiveness of each component of the proposed approach.

Wed 22 Oct. 14:15 - 16:15 PDT

#355
VoluMe – Authentic 3D Video Calls from Live Gaussian Splat Prediction

Martin de La Gorce · Charlie Hewitt · Tibor Takács · Robert Gerdisch · Zafiirah Hosenie · Givi Meishvili · Marek Kowalski · Thomas J. Cashman · Antonio Criminisi

Virtual 3D meetings offer the potential to enhance copresence, increase engagement and thus improve effectiveness of remote meetings compared to standard 2D video calls. However, representing people in 3D meetings remains a challenge; existing solutions achieve high quality by using complex hardware, making use of fixed appearance via enrolment, or by inverting a pre-trained generative model. These approaches lead to constraints that are unwelcome and ill-fitting for videoconferencing applications.We present the first method to predict 3D Gaussian reconstructions in real time from a single 2D webcam feed, where the 3D representation is not only live and realistic, but also authentic to the input video. By conditioning the 3D representation on each video frame independently, our reconstruction faithfully recreates the input video from the captured viewpoint (a property we call authenticity), while generalizing realistically to novel viewpoints. Additionally, we introduce a stability loss to obtain reconstructions that are temporally stable on video sequences.We show that our method delivers state-of-the-art accuracy in visual quality and stability metrics compared to existing methods, and demonstrate our approach in live one-to-one 3D meetings using only a standard 2D camera and display. This demonstrates that our approach can allow anyone to communicate volumetrically, via a method for 3D videoconferencing that is not only highly accessible, but also realistic and authentic.

Wed 22 Oct. 14:15 - 16:15 PDT

#356
EVDM: Event-based Real-world Video Deblurring with Mamba

Zhijing Sun · Senyan Xu · Kean Liu · Runze Tian · Xueyang Fu · Zheng-Jun Zha

Existing event-based video deblurring methods face limitations in extracting and fusing long-range spatiotemporal motion information from events, primarily due to restricted receptive fields or low computational efficiency, resulting in suboptimal deblurring performance.To address these issues, we introduce the state space model, which leverages linear complexity and global receptive fields for long-range modeling, and propose EVDM, a novel Event-based Video Deblurring framework with Mamba. The framework consists of: (1) Motion Clue Extraction Mamba (MCEM), which employs an event self-reconstruction loss to ensure the completeness of details when extracting long-range motion information. (2) Motion-aware Intra-frame Fusion Mamba (MIFM) and Inter-frame Temporal Propagation Mamba (ITPM), which utilize the motion-aware state space to perform cross-modal fusion and inter-frame information exchange guided by motion clues. Consequently, EVDM achieves superior detail restoration in blurred regions while ensuring temporal motion consistency across frames.Additionally, to overcome the limitation of fixed exposure ratios in existing event-frame paired datasets, we introduce T-RED, a high-quality, high-resolution dataset with varying exposure time ratios. T-RED provides more realistic and complex data for event-based video deblurring research.Experiments on multiple datasets demonstrate that EVDM outperforms previous SOTA methods.

Wed 22 Oct. 14:15 - 16:15 PDT

#357
From Enhancement to Understanding: Build a Generalized Bridge for Low-light Vision via Semantically Consistent Unsupervised Fine-tuning

Sen Wang · Shao Zeng · Tianjun Gu · zhizhong zhang · Ruixin Zhang · Shouhong Ding · Jingyun Zhang · Jun Wang · Xin TAN · Yuan Xie · Lizhuang Ma

Low-level enhancement and high-level visual understanding in low-light vision have traditionally been treated separately. Low-light enhancement improves image quality for downstream tasks, but existing methods rely on physical or geometric priors, limiting generalization. Evaluation mainly focuses on visual quality rather than downstream performance. Low-light visual understanding, constrained by scarce labeled data, primarily uses task-specific domain adaptation, which lacks scalability. To address these challenges, we build a generalized bridge between low-light enhancement and low-light understanding, which we term Generalized Enhancement For Understanding (GEFU). This paradigm improves both generalization and scalability. To address the diverse causes of low-light degradation, we leverage pretrained generative diffusion models to optimize images, achieving zero-shot generalization performance. Building on this, we propose Semantically Consistent Unsupervised Fine-tuning (SCUF). Specifically, to overcome text prompt limitations, we introduce an illumination-aware image prompt to explicitly guide image generation and propose a cycle-attention adapter to maximize its semantic potential. To mitigate semantic degradation in unsupervised training, we propose caption and reflectance consistency to learn high-level semantics and image-level spatial semantics. Extensive experiments demonstrate that our proposed method outperforms current state-of-the-art methods in traditional image quality and GEFU tasks including classification, detection, and semantic segmentation.

Wed 22 Oct. 14:15 - 16:15 PDT

#358
Physical Degradation Model-Guided Interferometric Hyperspectral Reconstruction with Unfolding Transformer

Yuansheng Li · Yunhao Zou · Linwei Chen · Ying Fu

Interferometric Hyperspectral Imaging (IHI) is a critical technique for large-scale remote sensing tasks due to its advantages in flux and spectral resolution. However, IHI is susceptible to complex errors arising from imaging steps, and its quality is limited by existing signal processing-based reconstruction algorithms. Two key challenges hinder performance enhancement: 1) the lack of training datasets. 2) the difficulty in eliminating IHI-specific degradation components through learning-based methods. To address these challenges, we propose a novel IHI reconstruction pipeline. First, based on imaging physics and radiometric calibration data, we establish a simplified yet accurate IHI degradation model and a parameter estimation method. This model enables the synthesis of realistic IHI training datasets from hyperspectral images (HSIs), bridging the gap between IHI reconstruction and deep learning. Second, we design the Interferometric Hyperspectral Reconstruction Unfolding Transformer (IHRUT), which achieves effective spectral correction and detail restoration through a stripe-pattern enhancement mechanism and a spatial-spectral transformer architecture. Experimental results demonstrate the superior performance and generalization capability of our method.

Wed 22 Oct. 14:15 - 16:15 PDT

#359
Intra-modal and Cross-modal Synchronization for Audio-visual Deepfake Detection and Temporal Localization

Ashutosh Anshul · Shreyas Gopal · Deepu Rajan · Eng Chng

Recent deepfake detection algorithms focus solely on uni-modal or cross-modal inconsistencies. While the former disregards audio-visual correspondence entirely rendering them less effective against multimodal attacks, the latter overlooks inconsistencies in a particular modality. Moreover, many models are single-stage supervised frameworks, effective on specific training data but less generalizable to new manipulations. To address these gaps, we propose a two-stage multimodal framework that first learns intra-modal and cross-modal temporal synchronization on real videos, capturing audio-visual correspondences crucial for deepfake detection and localization. We introduce a Gaussian-targeted loss in our pretraining model to focus on learning relative synchronization patterns across multimodal pairs. Using pretrained features, our approach not only enables classification on fully manipulated videos but also supports a localization module for partial deepfakes with only specific segments spoofed. Moreover, the pretraining stage does not require fine-tuning, thus reducing complexity. Our model, tested on various benchmark datasets, demonstrates strong generalization and precise temporal localization.

Wed 22 Oct. 14:15 - 16:15 PDT

#360
FineMotion: A Dataset and Benchmark with both Spatial and Temporal Annotation for Fine-grained Motion Generation and Editing

Bizhu Wu · Jinheng Xie · Meidan Ding · Zhe Kong · Jianfeng Ren · Ruibin Bai · Rong Qu · Linlin Shen

Generating realistic human motions from given textual descriptions has undergone significant advancements owing to the prevalence of digital humans. Although recent studies have achieved notable success in this task, they omitted specific body part movements and their timing.In this paper, we address this issue by enriching the textual description with more details. Specifically, we propose the FineMotion dataset, which contains over 442k human motion snippets, short segments of the human motion sequences, and their corresponding detailed human body part movement descriptions. Additionally, the dataset includes about 95k detailed paragraphs describing the movements of human body parts throughout entire motion sequences. Experimental results demonstrate the significance of our dataset on the text-driven fine-grained human motion generation task, especially with a remarkable +15.3\% improvement in Top-3 accuracy for the MDM network. Notably, we further support a zero-shot pipeline of fine-grained motion editing, which focuses on detailed editing in both spatial and temporal dimensions via text. The dataset and code will be released on GitHub.

Wed 22 Oct. 14:15 - 16:15 PDT

#361
Highlight
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

gaojie lin · Jianwen Jiang · Jiaqi Yang · Zerong Zheng · Chao Liang · ZHANG YUAN · Jingtu Li

End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals).

Wed 22 Oct. 14:15 - 16:15 PDT

#362
Context-Aware Academic Emotion Dataset and Benchmark

Luming Zhao · Jingwen Xuan · Jiamin Lou · Yonghui Yu · Wenwu Yang

Academic emotion analysis plays a crucial role in evaluating students' engagement and cognitive states during the learning process. This paper addresses the challenge of automatically recognizing academic emotions through facial expressions in real-world learning environments. While significant progress has been made in facial expression recognition for basic emotions, academic emotion recognition remains underexplored, largely due to the scarcity of publicly available datasets. To bridge this gap, we introduce RAER, a novel dataset comprising approximately 2,700 video clips collected from around 140 students in diverse, natural learning contexts such as classrooms, libraries, laboratories, and dormitories, covering both classroom sessions and individual study. Each clip was annotated independently by approximately ten annotators using two distinct sets of academic emotion labels with varying granularity, enhancing annotation consistency and reliability. To our knowledge, RAER is the first dataset capturing diverse natural learning scenarios. Observing that annotators naturally consider context cues—such as whether a student is looking at a phone or reading a book—alongside facial expressions, we propose CLIP-CAER (CLIP-based Context-aware Academic Emotion Recognition). Our method utilizes learnable text prompts within the vision-language model CLIP to effectively integrate facial expression and context cues from videos. Experimental results demonstrate that CLIP-CAER substantially outperforms state-of-the-art video-based facial expression recognition methods, which are primarily designed for basic emotions, emphasizing the crucial role of context in accurately recognizing academic emotions.

Wed 22 Oct. 14:15 - 16:15 PDT

#363
MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm

Ziyan Guo · Zeyu HU · Na Zhao · De Wen Soh

Human motion generation and editing are key components of computer vision. However, current approaches in this field tend to offer isolated solutions tailored to specific tasks, which can be inefficient and impractical for real-world applications. While some efforts have aimed to unify motion-related tasks, these methods simply use different modalities as conditions to guide motion generation. Consequently, they lack editing capabilities, fine-grained control, and fail to facilitate knowledge sharing across tasks. To address these limitations and provide a versatile, unified framework capable of handling both human motion generation and editing, we introduce a novel paradigm: Motion-Condition-Motion, which enables the unified formulation of diverse tasks with three concepts: source motion, condition, and target motion. Based on this paradigm, we propose a unified framework, MotionLab, which incorporates rectified flows to learn the mapping from source motion to target motion, guided by the specified conditions. In MotionLab, we introduce the 1) MotionFlow Transformer to enhance conditional generation and editing without task-specific modules; 2) Aligned Rotational Position Encoding to guarantee the time synchronization between source motion and target motion; 3) Task Specified Instruction Modulation; and 4) Motion Curriculum Learning for effective multi-task learning and knowledge sharing across tasks. Notably, our MotionLab demonstrates promising generalization capabilities and inference efficiency across multiple benchmarks for human motion. Our code and additional video results are available at: Anonymous Project Website.

Wed 22 Oct. 14:15 - 16:15 PDT

#364
Gradient-Reweighted Adversarial Camouflage for Physical Object Detection Evasion

Jiawei Liang · Siyuan Liang · Tianrui Lou · Ming Zhang · liwenjin liwenjin · Dunqiu fan · Xiaochun Cao

Object detection is widely used in real-world applications such as autonomous driving, yet adversarial camouflage poses a significant threat by deceiving detectors from multiple viewpoints. Existing techniques struggle to maintain consistent attack efficacy across different viewpoints. To address this, we propose GRAC, an adversarial camouflage framework that enhances attack effectiveness across viewpoints and distances. First, we identify conflicts in gradient updates across angles and introduce gradient reweighting to resolve them, enabling coordinated optimization. Second, we model light interactions to simulate illumination changes, improving robustness under varying lighting conditions. Additionally, we address non-uniform texture updates arisen from inconsistent sampling density during rendering by applying pooling-based texture regularization to improve smoothness. Extensive experiments in both simulated and physical environments demonstrate that GRAC outperforms existing methods across diverse conditions.

Wed 22 Oct. 14:15 - 16:15 PDT

#365
iManip: Skill-Incremental Learning for Robotic Manipulation

Zexin Zheng · Jia-Feng Cai · Xiao-Ming Wu · Yilin Wei · Yu-Ming Tang · Wei-Shi Zheng · Ancong Wu

The development of a generalist agent with adaptive multiple manipulation skills has been a long-standing goal in the robotics community.In this paper, we explore a crucial task, skill-incremental learning, in robotic manipulation, which is to endow the robots with the ability to learn new manipulation skills based on the previous learned knowledge without re-training. First, we build a skill-incremental environment based on the RLBench benchmark, and explore how traditional incremental methods perform in this setting. We find that they suffer from severe catastrophic forgetting due to the previous methods on classification overlooking the characteristics of temporality and action complexity in robotic manipulation tasks. Towards this end, we propose an incremental Manipulation framework, termed iManip, to mitigate the above issues. We firstly design a temporal replay strategy to maintain the integrity of old skills when learning new skill. Moreover, we propose the extendable PerceiverIO, consisting of an action prompt with extendable weight to adapt to new action primitives in new skill. Extensive experiments show that our framework performs well in Skill-Incremental Learning. Codes of the skill-incremental environment with our framework will be open-source.

Wed 22 Oct. 14:15 - 16:15 PDT

#366
Q-Norm: Robust Representation Learning via Quality-Adaptive Normalization

Lanning Zhang · Ying Zhou · Fei Gao · Ziyun Li · Maoying Qiao · Jinlan Xu · Nannan Wang

Although deep neural networks have achieved remarkable success in various computer vision tasks, they face significant challenges in degraded image understanding due to domain shifts caused by quality variations. Drawing biological inspiration from the human visual system (HVS), which dynamically adjusts perception strategies through contrast gain control and selective attention to salient regions, we propose Quality-Adaptive Normalization (Q-Norm) - a novel normalization method that learns adaptive parameters guided by image quality features. Our approach addresses two critical limitations of conventional normalization techniques: 1) Domain Covariance Shift: Existing methods fail to align feature distributions across different quality domains. Q-Norm implicitly achieves cross-domain alignment through quality-aware parameter adaptation without explicit loss functions. 2) Biological Plausibility: By mimicking HVS's contrast normalization mechanisms and attention-based feature selection, Q-Norm dynamically adjusts the mean and variance parameters using a pre-trained quality assessment model, ensuring robustness to image degradation. Extensive experiments across multiple tasks (image classification, semantic segmentation, object detection) demonstrate that Q-Norm consistently outperforms baseline methods on low-quality images. Code will be made available after peer review.

Wed 22 Oct. 14:15 - 16:15 PDT

#367
Proxy-Bridged Game Transformer for Interactive Extreme Motion Prediction

Yanwen Fang · Wenqi Jia · Xu Cao · Peng-Tao Jiang · Guodong Li · Jintai CHEN

Multi-person motion prediction becomes particularly challenging when handling highly interactive scenarios involving extreme motions. Previous works focused more on the case of `moderate' motions (e.g., walking together), where predicting each pose in isolation often yields reasonable results. However, these approaches fall short in modeling extreme motions like lindy-hop dances, as they require a more comprehensive understanding of cross-person dependencies. To bridge this gap, we introduce Proxy-bridged Game Transformer (PGformer), a Transformer-based foundation model that captures the interactions driving extreme multi-person motions. PGformer incorporates a novel cross-query attention module to learn bidirectional dependencies between pose sequences and a proxy unit that subtly controls bidirectional spatial information flow. We evaluate PGFormer on the challenging ExPI dataset, which involves large collaborative movements. Both quantitative and qualitative demonstrate the superiority of PGFormer in both short- and long-term predictions. We also test the proposed method on moderate movement datasets CMU-Mocap and MuPoTS-3D, generalizing PGFormer to scenarios with more than two individuals with promising results.

Wed 22 Oct. 14:15 - 16:15 PDT

#368
MeshAnything V2: Artist-Created Mesh Generation with Adjacent Mesh Tokenization

Yiwen Chen · Yikai Wang · Yihao Luo · Zhengyi Wang · Zilong Chen · Jun Zhu · Chi Zhang · Guosheng Lin

Meshes are the de facto 3D representation in the industry but are labor-intensive to produce. Recently, a line of research has focused on autoregressively generating meshes. This approach processes meshes into a sequence composed of vertices and then generates them vertex by vertex, similar to how a language model generates text. These methods have achieved some success but still struggle to generate complex meshes. One primary reason for this limitation is their inefficient tokenization methods. To address this issue, we introduce MeshAnything V2, an advanced mesh generation model designed to create Artist-Created Meshes that align precisely with specified shapes. A key innovation behind MeshAnything V2 is our novel Adjacent Mesh Tokenization (AMT) method. Unlike traditional approaches that represent each face using three vertices, AMT optimizes this by employing a single vertex wherever feasible, effectively reducing the token sequence length by about half on average. This not only streamlines the tokenization process but also results in more compact and well-structured sequences, enhancing the efficiency of mesh generation. With these improvements, MeshAnything V2 effectively doubles the face limit compared to previous models, delivering superior performance without increasing computational costs. Our extensive experiments across various mesh tokenization methods demonstrate that AMT is pivotal in achieving optimal results in both efficiency and performance.

Wed 22 Oct. 14:15 - 16:15 PDT

#369
MDD: A Dataset for Text-and-Music Conditioned Duet Dance Generation

Prerit Gupta · Jason Alexander Fotso-Puepi · Zhengyuan Li · Jay Mehta · Aniket Bera

We introduce Multimodal DuetDance (MDD), a diverse multimodal benchmark dataset designed for text-controlled and music-conditioned 3D duet dance motion generation. Our dataset comprises 620 minutes of high-quality motion capture data performed by professional dancers, synchronized with music, and detailed with over 10K fine-grained natural language descriptions. The annotations capture a rich movement vocabulary, detailing spatial relationships, body movements, and rhythm, making Text2Duet the first dataset to seamlessly integrate human motions, music, and text for duet dance synthesis. We introduce two novel tasks supported by our dataset: (1) Text-to-Duet, where given music and a textual prompt, both the leader and follower dance motion are generated (2) Text-to-Dance Accompaniment, where given music, textual prompt, and the leader's motion, the follower's motion is generated in a cohesive, text-aligned manner.

Wed 22 Oct. 14:15 - 16:15 PDT

#370
π-AVAS: Can Physics-Integrated Audio-Visual Modeling Boost Neural Acoustic Synthesis?

Susan Liang · Chao Huang · Yolo Yunlong Tang · Zeliang Zhang · Chenliang Xu

The Audio-Visual Acoustic Synthesis (AVAS) task aims to model realistic audio propagation behavior within a specific visual scene. Prior works often rely on sparse image representations to guide acoustic synthesis. However, we argue that this approach is insufficient to capture the intricate physical properties of the environment and may struggle with generalization across diverse scenes. In this work, we review the limitations of existing pipelines and address the research question: Can we leverage physical audio-visual associations to enhance neural acoustic synthesis? We introduce Physics-Integrated Audio-Visual Acoustic Synthesis (PI-AVAS or $\pi$-AVAS), a novel framework designed with two key objectives. i) Generalization: We develop a vision-guided audio simulation framework that leverages physics-based sound propagation. By explicitly modeling vision-grounded geometry and sound rays, our approach achieves robust performance across diverse visual environments. ii) Realism: While simulation-based approaches offer generalizability, they often compromise on realism. To mitigate this, we incorporate a second stage for data-centric refinement, where we propose a flow matching-based audio refinement model to narrow the gap between simulation and real-world audio-visual scenes. Extensive experiments demonstrate the effectiveness and robustness of our method. We achieve state-of-the-art performance on the RWAVS-Gen, RWAVS, and RAF datasets. Additionally, we show that our approach can be seamlessly integrated with existing methods to significantly improve their performance.

Wed 22 Oct. 14:15 - 16:15 PDT

#371
Exploring Weather-aware Aggregation and Adaptation for Semantic Segmentation under Adverse Conditions

Yuwen Pan · Rui Sun · Wangkai Li · Tianzhu Zhang

Semantic segmentation under adverse conditions is crucial for ensuring robust and accurate visual perception in challenging weather conditions. The distinct characteristics of extreme scenarios hinder traditional segmentation paradigms, highlighting the necessity for tailored approaches for adverse weathers. Due to the scarcity of labeled data in such scenarios, the unsupervised domain adaptation paradigm is commonly utilized to leverage knowledge from normal weather conditions. Although existing methods strive to absorb information from labeled normal weather data and unlabeled adverse condition images, they face significant challenges due to weather unawareness and severe feature heterogeneity, thus struggling to effectively parse scenes under adverse conditions. In this paper, we propose a novel weather-aware aggregation and adaptation network that leverages characteristic knowledge to achieve weather homogenization and enhance scene perception. Specifically, we introduce amplitude prompt aggregation to capture essential characteristics from the Fourier frequency domain that are indicative of different weather conditions. Additionally, we employ weather heterogeneity adaptation to mitigate the inter-domain heterogeneity, thereby achieving feature homogenization across diverse environments. Extensive experimental results on multiple challenging benchmarks demonstrate that our method achieves consistent improvements for semantic segmentation under adverse conditions.

Wed 22 Oct. 14:15 - 16:15 PDT

#372
SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning

Lanmiao Liu · Esam Ghaleb · asli ozyurek · Zerrin Yumak

Creating a virtual avatar with semantically coherent gestures that are aligned with speech is a challenging task. Existing gesture generation research mainly focused on generating rhythmic beat gestures, neglecting the semantic context of the gestures. In this paper, we propose a novel approach for semantic grounding in co-speech gesture generation that integrates semantic information at both fine-grained and global levels. Our approach starts with learning the motion prior through a vector-quantized variational autoencoder. Built on this model, a second-stage module is applied to automatically generate gestures from speech, text-based semantics and speaker identity that ensures consistency between the semantic relevance of generated gestures and co-occurring speech semantics through semantic coherence and relevance modules. Experimental results demonstrate that our approach enhances the realism and coherence of semantic gestures. Extensive experiments and user studies show that our method outperforms state-of-the-art approaches across two benchmarks in co-speech gesture generation in both objective and subjective metrics. The qualitative results of our model can be viewed at https://semgesture.github.io. Our code, dataset and pre-trained models will be shared upon acceptance.

Wed 22 Oct. 14:15 - 16:15 PDT

#373
Metric Convolutions: A Unifying Theory to Adaptive Image Convolutions

Thomas Dagès · Michael Lindenbaum · Alfred Bruckstein

Standard convolutions are prevalent in image processing and deep learning, but their fixed kernels limits adaptability. Several deformation strategies of the reference kernel grid have been proposed. Yet, they lack a unified theoretical framework. By returning to a metric perspective for images, now seen as two-dimensional manifolds equipped with notions of local and geodesic distances, either symmetric (Riemannian) or not (Finsler), we provide a unifying principle: the kernel positions are samples of unit balls of implicit metrics. With this new perspective, we also propose metric convolutions, a novel approach that samples unit balls from explicit signal-dependent metrics, providing interpretable operators with geometric regularisation. This framework, compatible with gradient-based optimisation, can directly replace existing convolutions applied to either input images or deep features of neural networks. Metric convolutions typically require fewer parameters and provide better generalisation. Our approach shows competitive performance in standard denoising and classification tasks.

Wed 22 Oct. 14:15 - 16:15 PDT

#374
RobAVA: A Large-scale Dataset and Baseline Towards Video based Robotic Arm Action Understanding

Baoli Sun · Ning Wang · Xinzhu Ma · Anqi Zou · Lu Yihang · Chuixuan Fan · Zhihui Wang · Kun Lu · Zhiyong Wang

Understanding the behaviors of robotic arms is essential for various robotic applications such as logistics management, precision agriculture, and automated manufacturing. However, the lack of large-scale and diverse datasets significantly hinders progress in video-based robotic arm action understanding, highlighting the need for collecting a new large-scale dataset. In particular, our RobAVA contains ~40k video sequences with video-level fine-grained annotations, covering basic actions such as picking, pushing, and placing, as well as their combinations in different orders and interactions with various objects. Distinguished to existing action recognition benchmarks, RobAVA includes instances of both normal and anomalous executions for each action category. Our further analysis reveals that the primary challenge in robotic arm action recognition lies in the fact that a complete action consists of a sequence of fundamental, atomic behaviors, requiring models to learn the inter-relationships among them. To this end, we propose a novel baseline approach, AGPT-Net, which re-defines the problem of understanding robotic arm actions as a task of aligning video sequences with atomic attributes.To enhance AGPT-Net's ability to distinguish normal and anomalous action instances, we introduce a joint semantic space constraint between category and attribute semantics, thereby amplifying the separation between normal and anomalous attribute representations for each action. We conduct extensive experiments to demonstrate AGPT-Net’s superiority over other mainstream recognition models.

Wed 22 Oct. 14:15 - 16:15 PDT

#375
IDFace: Face Template Protection for Efficient and Secure Identification

Sunpill Kim · Seunghun Paik · Chanwoo Hwang · Dongsoo Kim · Junbum Shin · Jae Hong Seo

As face recognition systems (FRS) become more widely used, user privacy becomes more important. A key privacy issue in FRS is to protect the user’s face template, since the characteristics of the user’s face image can be recovered from the template. Although recent advances in cryptographic tools such as homomorphic encryption (HE) have provided opportunities for securing the FRS, HE cannot be used directly with FRS in an efficient plug-and-play manner. In particular, although HE is functionally complete for arbitrary programs, it is basically designed for algebraic operations on encrypted data of predetermined shape such as a polynomial ring. Thus, a non-tailored combination of HE and the system can yield very inefficient performance, and many previous HE-based face template protection methods are hundreds of times slower than plain systems without protection. In this study, we propose $\mathsf{IDFace}$, a new HE-based secure and efficient face identification method with template protection. The $\mathsf{IDFace}$ is designed on the basis of two novel techniques for efficient searching on a (homomorphically encrypted) biometric database with an angular metric. The first technique is a template representation transformation that sharply reduces the unit cost for the matching test. The second is a space-efficient encoding that reduces wasted space from the encryption algorithm, thus saving the number of operations on encrypted templates. Through experiments, we show that $\mathsf{IDFace}$ can identify a face template from among a database of 1M encrypted templates in less than a second, which is at most \textcolor{red}{\textrm{97.6X}} faster than the previous best result using HE.

Wed 22 Oct. 14:15 - 16:15 PDT

#376
Learning Visual Hierarchies in Hyperbolic Space for Image Retrieval

Ziwei Wang · Sameera Ramasinghe · Chenchen Xu · Julien Monteil · Loris Bazzani · Thalaiyasingam Ajanthan

Structuring latent representations in a hierarchical manner enables models to learn patterns at multiple levels of abstraction. However, most prevalent image understanding models focus on visual similarity, and learning visual hierarchies is relatively unexplored. In this work, for the first time, we introduce a learning paradigm that can encode user-defined multi-level complex visual hierarchies in hyperbolic space without requiring explicit hierarchical labels. As a concrete example, first, we define a part-based image hierarchy using object-level annotations within and across images. Then, we introduce an approach to enforce the hierarchy using contrastive loss with pairwise entailment metrics. Finally, we discuss new evaluation metrics to effectively measure hierarchical image retrieval. Encoding these complex relationships ensures that the learned representations capture semantic and structural information that transcends mere visual similarity. Experiments in part-based image retrieval show significant improvements in hierarchical retrieval tasks, demonstrating the capability of our model in capturing visual hierarchies.

Wed 22 Oct. 14:15 - 16:15 PDT

#377
DPoser-X: Diffusion Model as Robust 3D Whole-body Human Pose Prior

Junzhe Lu · Jing Lin · Hongkun Dou · Ailing Zeng · Yue Deng · Xian Liu · Zhongang Cai · Lei Yang · YULUN ZHANG · Haoqian Wang · Ziwei Liu

We present DPoser-X, a diffusion-based prior model for 3D whole-body human poses. Building a versatile and robust full-body human pose prior remains challenging due to the inherent complexity of articulated human poses and the scarcity of high-quality whole-body pose datasets. To address these limitations, we introduce a Diffusion model as body Pose prior (DPoser) and extend it to DPoser-X for expressive whole-body human pose modeling.Our approach unifies various pose-centric tasks as inverse problems, solving them through variational diffusion sampling. To enhance performance on downstream applications, we introduce a novel truncated timestep scheduling method specifically designed for pose data characteristics. We also propose a masked training mechanism that effectively combines whole-body and part-specific datasets, enabling our model to capture interdependencies between body parts while avoiding overfitting to specific actions.Extensive experiments demonstrate DPoser-X's robustness and versatility across multiple benchmarks for body, hand, face, and full-body pose modeling. Our model consistently outperforms state-of-the-art alternatives, establishing a new benchmark for whole-body human pose prior modeling.

Wed 22 Oct. 14:15 - 16:15 PDT

#378
Dual-level Prototype Learning for Composite Degraded Image Restoration

Zhongze Wang · Haitao Zhao · Lujian Yao · Jingchao Peng · Kaijie Zhao

Images captured under severe weather conditions often suffer from complex, composite degradations, varying in intensity. In this paper, we introduce a novel method, Dual-Level Prototype Learning (DPL), to tackle the challenging task of composite degraded image restoration. Unlike previous methods that rely on fixed embeddings to characterize degradation types, DPL maintains a number of degradation-level prototypes to represent the specific degradation scenes dynamically. Furthermore, considering the diverse factors influencing each degradation type, factor-level prototypes are incorporated to capture variations in individual degradation factors. Image features are matched with both degradation-level and factor-level prototypes, producing detailed scene embeddings that enhance the network's understanding of composite degradations. These scene embeddings are then processed through Dual Scene Embedding Transformer Blocks to guide the restoration process. To further refine the prototype distribution, we propose a Prototype Scatter Learning Loss, which enables prototypes within the same degradation to learn more information and push prototypes between different degradations to be separate. Additionally, we introduce a new dataset named Variable Composite Degradation (VCD) dataset which contains images with different intensities of each type of composite degradation to validate the efficacy of our method. Extensive experiments demonstrate that DPL significantly outperforms existing methods in restoring images with composite degradations.

Wed 22 Oct. 14:15 - 16:15 PDT

#379
BVINet: Unlocking Blind Video Inpainting with Zero Annotations

zhiliang wu · Kerui Chen · Kun Li · Hehe Fan · Yi Yang

Video inpainting aims to fill in corrupted regions of the video with plausible contents. Existing methods generally assume that the locations of corrupted regions are known, focusing primarily on the “how to inpaint”. This reliance necessitates manual annotation of the corrupted regions using binary masks to indicate “where to inpaint”. However, the annotation of these masks is labor-intensive and expensive, limiting the practicality of current methods. In this paper, we expect to relax this assumption by defining a new blind video inpainting setting, enabling the networks to learn the mapping from corrupted video to inpainted result directly, eliminating the need of corrupted region annotations. Specifically, we propose an end-to-end blind video inpainting network (BVINet) to address both “where to inpaint” and “how to inpaint” simultaneously. On the one hand, BVINet can predict the masks of corrupted regions by detecting semantic-discontinuous regions of the frame and utilizing temporal consistency prior of the video. On the other hand, the predicted masks are incorporated into the BVINet, allowing it to capture valid context information from uncorrupted regions to fill in corrupted ones. Besides, we introduce a consistency loss to regularize the training parameters of BVINet. In this way, mask prediction and video completion mutually constrain each other, thereby maximizing the overall performance of the trained model. Recognizing that existing datasets are unsuitable for the blind video inpainting task due to the presence of prior knowledge (e.g., corrupted contents and clear borders), we contribute a new dataset specifically designed for blind video inpainting. Extensive experimental results demonstrate the effectiveness and superiority of our method.

Wed 22 Oct. 14:15 - 16:15 PDT

#380
HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly

Chang Liu · Yunfan Ye · Fan Zhang · Qingyang Zhou · Yuchuan Luo · Zhiping Cai

Numerous synthesized videos from generative models, especially human-centric ones that simulate realistic human actions, pose significant threats to human information security and authenticity. While progress has been made in binary forgery video detection, the lack of fine-grained understanding of forgery types raises concerns regarding both reliability and interpretability, which are critical for real-world applications. To address this limitation, we propose HumanSAM, a new framework that builds upon the fundamental challenges of video generation models. Specifically, HumanSAM aims to classify human-centric forgeries into three distinct types of artifacts commonly observed in generated content: spatial, appearance, and motion anomaly.To better capture the features of geometry, semantics and spatiotemporal consistency, we propose to generate the human forgery representation by fusing two branches of video understanding and spatial depth. We also adopt a rank-based confidence enhancement strategy during the training process to learn more robust representation by introducing three prior scores. For training and evaluation, we construct the first public benchmark, the Human-centric Forgery Video (HFV) dataset, with all types of forgeries carefully annotated semi-automatically. In our experiments, HumanSAM yields promising results in comparison with state-of-the-art methods, both in binary and multi-class forgery classification.

Wed 22 Oct. 14:15 - 16:15 PDT

#381
DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses

Yatian Pang · Bin Zhu · Bin Lin · Mingzhe Zheng · Francis Tay · Ser-Nam Lim · Harry Yang · Li Yuan

In this work, we present DreamDance, a novel method for animating human images using only skeleton pose sequences as conditional inputs. Existing approaches struggle with generating coherent, high-quality content in an efficient and user-friendly manner. Concretely, baseline methods relying on only 2D pose guidance lack the cues of 3D information like depth and normal maps, leading to suboptimal results. Other works introduce extra representations to provide additional 3D information but inevitably involve a cumbersome and time-intensive process. To address these limitations, DreamDance enriches 3D geometry cues from 2D poses by introducing an efficient diffusion model, enabling high-quality human image animation with various guidance. Our key insight is that human images naturally exhibit multiple levels of correlation, progressing from coarse skeleton poses to fine-grained geometry cues, and further from these geometry cues to explicit appearance details. Capturing such correlations could enrich the guidance signals, facilitating intra-frame coherency and inter-frame consistency. Specifically, we construct the TikTok-Dance5K dataset, comprising 5K high-quality dance videos with detailed frame annotations, including human pose, depth, and normal maps. Next, we introduce a Mutually Aligned Geometry Diffusion Model to generate fine-grained depth and normal maps for enriched guidance. Finally, a Cross-domain Controller incorporates multi-level guidance to animate human images effectively with a video diffusion model. Extensive experiments demonstrate that our method achieves state-of-the-art performance in animating human images compared to baseline methods.

Wed 22 Oct. 14:15 - 16:15 PDT

#382
I2VControl: Disentangled and Unified Video Motion Synthesis Control

Wanquan Feng · Tianhao Qi · Jiawei Liu · Mingzhen Sun · Pengqi Tu · Tianxiang Ma · Fei Dai · Songtao Zhao · SiYu Zhou · Qian HE

Motion controllability is crucial in video synthesis. However, most previous methods are limited to single control types, and combining them often results in logical conflicts. In this paper, we propose a disentangled and unified framework, namely I2VControl, to overcome the logical conflicts. We rethink camera control, object dragging, and motion brush, reformulating all tasks into a consistent representation based on point trajectories, each managed by a dedicated formulation. Accordingly, we propose a spatial partitioning strategy, where each unit is assigned to a concomitant control category, enabling diverse control types to be dynamically orchestrated within a single synthesis pipeline without conflicts. Furthermore, we design an adapter structure that functions as a plug-in for pre-trained models and is agnostic to specific model architectures. We conduct extensive experiments, achieving excellent performance on various control tasks, and our method further facilitates user-driven creative combinations, enhancing innovation and creativity. Please see the video results in our anonymous github repository: https://github.com/iccv2025sub592/sub592.

Wed 22 Oct. 14:15 - 16:15 PDT

#383
Highlight
MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh

Shuangkang Fang · I-Chao Shen · Yufeng Wang · Yi-Hsuan Tsai · Yi Yang · Shuchang Zhou · Wenrui Ding · Takeo Igarashi · Ming-Hsuan Yang

We present MeshLLM, a novel framework that leverages large language models (LLMs) to understand and generate text-serialized 3D meshes. Our approach addresses key limitations in existing methods, including the limited dataset scale when catering to LLMs' token length and the loss of 3D structural information during mesh serialization. We introduce a Primitive-Mesh decomposition strategy, which divides 3D meshes into structurally meaningful subunits. This enables the creation of a large-scale dataset with 1500k+ samples, almost 50x larger than previous methods, which aligns better with the LLM scaling law principles. Furthermore, we propose inferring face connectivity from vertices and local mesh assembly training strategies, significantly enhancing the LLMs' ability to capture mesh topology and spatial structures. Experiments show that MeshLLM outperforms the state-of-the-art LLaMA-Mesh in both mesh generation quality and shape understanding, highlighting its great potential in processing text-serialized 3D meshes.

Wed 22 Oct. 14:15 - 16:15 PDT

#384
On-Device Diffusion Transformer Policy for Efficient Robot Manipulation

Yiming Wu · Huan Wang · Zhenghao Chen · Jianxin Pang · Dong Xu

Diffusion Policies have significantly advanced robotic manipulation tasks via imitation learning, but their application on resource-constrained mobile platforms remains challenging due to computational inefficiency and extensive memory footprint. In this paper, we propose \textbf{LightDP}, a novel framework specifically designed to accelerate Diffusion Policies for real-time deployment on mobile devices. LightDP addresses the computational bottleneck through two core strategies: network compression of the denoising modules and reduction of the required sampling steps. We first conduct an extensive computational analysis on existing Diffusion Policy architectures, identifying the denoising network as the primary contributor to latency. To overcome performance degradation typically associated with conventional pruning methods, we introduce a unified pruning and retraining pipeline, optimizing the model's post-pruning recoverability explicitly. Furthermore, we combine pruning techniques with consistency distillation to effectively reduce sampling steps while maintaining action prediction accuracy. Experimental evaluations on three standard datasets, \ie, Push-T, CALVIN, and LIBERO, demonstrate that LightDP achieves real-time action prediction on mobile devices with competitive performance, marking an important step toward practical deployment of diffusion-based policies in resource-limited environments.

Wed 22 Oct. 14:15 - 16:15 PDT

#385
Generic Event Boundary Detection via Denoising Diffusion

Jaejun Hwang · Dayoung Gong · Manjin Kim · Minsu Cho

Generic event boundary detection (GEBD) aims to identify natural boundaries in a video, segmenting it into distinct and meaningful chunks. Despite the inherent subjectivity of event boundaries, previous methods have focused on deterministic predictions, overlooking the diversity of plausible solutions.In this paper, we introduce a novel diffusion-based boundary detection model, dubbed DiffGEBD, that tackles the problem of GEBD from a generative perspective. The proposed model encodes relevant changes across adjacent frames via temporal self-similarity and then iteratively decodes random noise into plausible event boundaries being conditioned on the encoded features. Classifier-free guidance allows the degree of diversity to be controlled in denoising diffusion. In addition, we introduce a new evaluation metric to assess the quality of predictions considering both diversity and fidelity. Experiments show that our method achieves strong performance on two standard benchmarks, TAPOS and Kinetics-GEBD, generating diverse and plausible event boundaries.

Wed 22 Oct. 14:15 - 16:15 PDT

#386
Learning Pixel-adaptive Multi-layer Perceptrons for Real-time Image Enhancement

Junyu Lou · Xiaorui Zhao · Kexuan Shi · Shuhang Gu

Deep learning-based bilateral grid processing has emerged as a promising solution for image enhancement, inherently encoding spatial and intensity information while enabling efficient full-resolution processing through slicing operations. However, existing approaches are limited to linear affine transformations, hindering their ability to model complex color relationships. Meanwhile, while multi-layer perceptrons (MLPs) excel at non-linear mappings, traditional MLP-based methods employ globally shared parameters, which is hard to deal with localized variations. To overcome these dual challenges, we propose a Bilateral Grid-based Pixel-Adaptive Multi-layer Perceptron (BPAM) framework. Our approach synergizes the spatial modeling of bilateral grids with the non-linear capabilities of MLPs. Specifically, we generate bilateral grids containing MLP parameters, where each pixel dynamically retrieves its unique transformation parameters and obtain a distinct MLP for color mapping based on spatial coordinates and intensity values. In addition, we propose a novel grid decomposition strategy that categorizes MLP parameters into distinct types stored in separate subgrids. Multi-channel guidance maps are used to extract category-specific parameters from corresponding subgrids, ensuring effective utilization of color information during slicing while guiding precise parameter generation. Extensive experiments on public datasets demonstrate that our method outperforms state-of-the-art methods in performance while maintaining real-time processing capabilities.

Wed 22 Oct. 14:15 - 16:15 PDT

#387
Multi-Modal Few-Shot Temporal Action Segmentation

Zijia Lu · Ehsan Elhamifar

Procedural videos are critical for learning new tasks. Temporal action segmentation (TAS), which classifies the action in every video frame, has become essential for understanding procedural videos. Existing TAS models, however, are limited to a fixed-set of tasks learned at training and unable to adapt to novel tasks at test time. Thus, we introduce the new problem of Multi-Modal Few-shot Temporal Action Segmentation (MMF-TAS) to learn models that can generalize to novel procedural tasks with minimal visual/textual examples. We propose the first MMF-TAS framework, by designing a Prototype Graph Network (PGNet). PGNet contains a Prototype Building Block that summarizes action information from support videos of the novel tasks via an Action Relation Graph, and encodes this information into action prototypes via a Dynamic Graph Transformer. Next, it employs a Matching Block that compares action prototypes with query videos to infer framewise action labels. To exploit the advantages of both visual and textual modalities, we compute separate action prototypes for each modality and combine the two modalities by a prediction fusion method to avoid overfitting on one modality. By extensive experiments on procedural datasets, we show that our method successfully adapts to novel tasks during inference and significantly outperforms baselines.

Wed 22 Oct. 14:15 - 16:15 PDT

#388
SIMS: Simulating Stylized Human-Scene Interactions with Retrieval-Augmented Script Generation

Wenjia Wang · Liang Pan · Zhiyang Dou · Jidong Mei · Zhouyingcheng Liao · Yifan Wu · Yuke Lou · Jingbo Wang · Lei Yang · Taku Komura

Simulating stylized human-scene interactions (HSI) in physical environments is a challenging yet fascinating task. Prior works emphasize long-term execution but fall short in achieving both diverse style and physical plausibility. To tackle this challenge, we introduce a novel hierarchical framework named SIMS that seamlessly bridges high-level script-driven intent with a low-level control policy, enabling more expressive and diverse human-scene interactions. Specifically, we employ Large Language Models with Retrieval-Augmented Generation (RAG) to generate coherent and diverse long-form scripts, providing a rich foundation for motion planning. A versatile multi-condition physics-based control policy is also developed, which leverages text embeddings from the generated scripts to encode stylistic cues, simultaneously perceiving environmental geometries and accomplishing task goals. By integrating the retrieval-augmented script generation with the multi-condition controller, our approach provides a unified solution for generating stylized HSI motions. We further introduce a comprehensive planning dataset produced by RAG and a stylized motion dataset featuring diverse locomotions and interactions. Extensive experiments demonstrate SIMS's effectiveness in executing various tasks and generalizing across different scenarios, significantly outperforming previous methods.

Wed 22 Oct. 14:15 - 16:15 PDT

#389
ProbRes: Probabilistic Jump Diffusion for Open-World Egocentric Activity Recognition

Sanjoy Kundu · Shanmukha Vellamcheti · Sathyanarayanan Aakur

Open-world egocentric activity recognition poses a fundamental challenge due to its unconstrained nature, requiring models to infer unseen activities from an expansive, partially observed search space. We introduce ProbRes, a Probabilistic Residual search framework based on jump-diffusion that efficiently navigates this space by balancing prior-guided exploration with likelihood-driven exploitation. Our approach integrates structured commonsense priors to construct a semantically coherent search space, adaptively refines predictions using Vision-Language Models (VLMs) and employs a stochastic search mechanism to locate high-likelihood activity labels while minimizing exhaustive enumeration efficiently. We systematically evaluate ProbRes across multiple openness levels (L0–L3), demonstrating its adaptability to increasing search space complexity. In addition to achieving state-of-the-art performance on benchmark datasets (GTEA Gaze, GTEA Gaze+, EPIC-Kitchens, and Charades-Ego), we establish a clear taxonomy for open-world recognition, delineating the challenges and methodological advancements necessary for egocentric activity understanding. Our results highlight the importance of structured search strategies, paving the way for scalable and efficient open-world activity recognition. Code (in supplementary) will be shared publicly after review.

Wed 22 Oct. 14:15 - 16:15 PDT

#390
Unified Adversarial Augmentation for Improving Palmprint Recognition

Jianlong Jin · Chenglong Zhao · Ruixin Zhang · Sheng Shang · Yang Zhao · Jun Wang · Jingyun Zhang · Shouhong Ding · Wei Jia · Yunsheng Wu

Current palmprint recognition models achieve strong performance on constrained datasets,yet exhibit significant limitations in handling challenging palmprint samples with geometric distortions and textural degradations. Data augmentation is widely adopted to improve model generalization.However, existing augmentation methods struggle to generate palmprint-specific variations while preserving identity consistency,leading to suboptimal performance.To address these problems, we propose a unified adversarial augmentation framework.It first utilizes an adversarial training paradigm for palmprint recognition, optimizing for challenging augmented samples by incorporating the feedback from the recognition network.We enhance palmprint images with both geometric and textual variations.Specifically, it adopts a spatial transformation module and a new identity-preserving module, which synthesizes palmprints with diverse textural variations while maintaining consistent identity.For more effective adversarial augmentation, a dynamic sampling strategy is proposed.Extensive experiments demonstrate the superior performance of our method on both challenging and constrained palmprint datasets. Our code will be released.

Wed 22 Oct. 14:15 - 16:15 PDT

#391
Not All Degradations Are Equal: A Targeted Feature Denoising Framework for Generalizable Image Super-Resolution

hongjun wang · Jiyuan Chen · Zhengwei Yin · Xuan Song · Yinqiang Zheng

Generalizable Image Super-Resolution aims to enhance model generalization capabilities under unknown degradations. To achieve such goal, the models are expected to focus only on image content-related features instead of degradation details (i.e., overfitting degradations).Recently, numerous approaches such as dropout and feature alignment have been proposed to suppress models' natural tendency to overfitting degradations and yields promising results. Nevertheless, these works have assumed that models overfit to all degradation types (e.g., blur, noise), while through careful investigations in this paper, we discover that models predominantly overfit to noise, largely attributable to the distinct degradation pattern in noise compared to other degradation types. In this paper, we propose a targeted feature denoising framework, comprising noise detection and denoising modules. Our approach represents a general solution that can be seamlessly integrated with existing super-resolution models without requiring architectural modifications. Our framework demonstrates superior performance compared to previous regularization-based methods across five traditional benchmark and datasets, encompassing both synthetic and real-world scenarios.

Wed 22 Oct. 14:15 - 16:15 PDT

#392
SHeaP: Self-supervised Head Geometry Predictor Learned via 2D Gaussians

Liam Schoneveld · Zhe Chen · Davide Davoli · Jiapeng Tang · Saimon Terazawa · Ko Nishino · Matthias Nießner

Accurate, real-time 3D reconstruction of human heads from monocular images and videos underlies numerous visual applications. As 3D ground truth data is hard to come by at scale, previous methods have sought to learn from abundant 2D videos in a self-supervised manner. Typically, this involves the use of differentiable mesh rendering, which is effective but faces limitations. To improve on this, we propose SHeaP (Self-supervised Head Geometry Predictor Learned via 2D Gaussians).Given a source image, we predict a 3DMM mesh and a set of Gaussians that are rigged to this mesh. We then reanimate this rigged head avatar to match a target frame, and backpropagate photometric losses to both the 3DMM and Gaussian prediction networks. We find that using Gaussians for rendering substantially improves the effectiveness of this self-supervised approach.Training solely on 2D data, our method surpasses existing self-supervised approaches in geometric evaluations on the NoW benchmark for neutral faces and a new benchmark for non-neutral expressions. Our method also produces highly expressive meshes, outperforming state-of-the-art in emotion classification.

Wed 22 Oct. 14:15 - 16:15 PDT

#393
MultiModal Action Conditioned Video Simulation

Yichen Li · Antonio Torralba

General-purpose household robots require real-time fine motor control to handle delicate tasks and urgent situations. In this work, we introduce the senses of proprioception, kinesthesia, force haptics, and muscle activation to capture such precise control. This comprehensive set of multimodal senses naturally enables fine-grained interactions that are difficult to simulate with unimodal or text con-ditioned generative models. To effectively simulate fine-grained multisensory actions, we develop a feature learning paradigm that aligns these modalities while preserving the unique information each modality provides. We further regularize action trajectory features to enhance causality for representing intricate interaction dynamics. Experiments show that incorporating multimodal senses improves simulation accuracy and reduces temporal drift. Extensive ablation studies and downstream applications demonstrate effectiveness and practicality of our work.

Wed 22 Oct. 14:15 - 16:15 PDT

#394
LHM: Large Animatable Human Reconstruction Model for Single Image to 3D in Seconds

Lingteng Qiu · Xiaodong Gu · Peihao Li · Qi Zuo · Weichao Shen · Junfei Zhang · Kejie Qiu · Weihao Yuan · Guanying Chen · Zilong Dong · Liefeng Bo

Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation. Recent advances in 3D human reconstruction mainly focus on static human modeling, and the reliance of using synthetic 3D scans for training limits their generalization ability. Conversely, optimization-based video methods achieve higher fidelity but demand controlled capture conditions and computationally intensive refinement processes. Motivated by the emergence of large reconstruction models for efficient static reconstruction, we propose LHM (Large Animatable Human Reconstruction Model) to infer high-fidelity avatars represented as 3D Gaussian splatting in a feed-forward pass. Our model leverages a multimodal transformer architecture to effectively encode the human body positional features and image features with attention mechanism, enabling detailed preservation of clothing geometry and texture. To further boost the face identity preservation and fine detail recovery, we propose a head feature pyramid encoding scheme to aggregate multi-scale features of the head regions. Extensive experiments demonstrate that our LHM generates plausible animatable human in seconds without post-processing for face and hands, outperforming existing methods in both reconstruction accuracy and generalization ability.

Wed 22 Oct. 14:15 - 16:15 PDT

#395
Learning Deblurring Texture Prior from Unpaired Data with Diffusion Model

Chengxu Liu · Lu Qi · Jinshan Pan · Xueming Qian · Ming-Hsuan Yang

Since acquiring large amounts of realistic blurry-sharp image pairs is difficult and expensive, learning blind image deblurring from unpaired data is a more practical and promising solution. Unfortunately, most existing approaches only use adversarial learning to bridge the gap from blurry domains to sharp domains, ignoring the complex and unpredictable nature of real-world blurry patterns. In this paper, we propose a novel diffusion model (DM)-based framework, dubbed TP-Diff, for image deblurring by learning spatially varying texture prior from unpaired sharp data. In particular, TP-Diff performs DM to generate the prior knowledge used to recover the texture of blurry images. To implement it, we propose a Texture Prior Encoder (TPE) that introduces a memory mechanism to encode the texture prior and thereby provide supervision for the DM training. To fully exploit the generated texture priors, we further present the Texture Transfer Transformer layer (TTformer), in which a novel Filter-Modulated Multi-head Self-Attention (FM-MSA) efficiently removes spatially varying blurring through adaptive filtering. In addition, a wavelet-based adversarial loss is used to preserve high-frequency texture details. Extensive evaluations demonstrate that TP-Diff provides a promising unsupervised deblurring solution and outperforms SOTA methods in six widely-used benchmarks.

Wed 22 Oct. 14:15 - 16:15 PDT

#396
GUAVA: Generalizable Upper Body 3D Gaussian Avatar

Dongbin Zhang · Yunfei Liu · Lijian Lin · Ye Zhu · Yang Li · Minghan Qin · Yu Li · Haoqian Wang

Reconstructing high-quality, animatable 3D human avatar with expressive facial and hand motions from a single image has gained significant attention due to its broad application potential. 3D human avatar reconstruction typically requires multi-view or monocular videos and training on individual IDs, which is both complex and time-consuming. Furthermore, limited by SMPLX’s expressiveness, these methods often focus on body motion but struggle with facial expressions. To address these challenges, we first introduce an expressive human model (EHM) to enhance facial expression capabilities and develop an accurate tracking method. Based on this template model, we propose GUAVA, the first framework for fast animatable upper body 3D Gaussian avatar reconstruction. We leverage inverse texture mapping and projection sampling techniques to infer Ubody (upper-body) Gaussians from a single image. The rendered images are refined through a neural refiner. Experimental results demonstrate that GUAVA significantly outperforms previous methods in rendering quality and offers significant speed improvements, with reconstruction times in the sub-second range (~ 0.1s), and supports real-time animation and rendering.

Wed 22 Oct. 14:15 - 16:15 PDT

#397
Recognizing Actions from Robotic View for Natural Human-Robot Interaction

Ziyi Wang · Peiming Li · Hong Liu · Zhichao Deng · Can Wang · Jun Liu · Junsong Yuan · Mengyuan Liu

Natural Human-Robot Interaction (N-HRI) requires a robot to recognize human actions at varying distances while accounting for disturbing motions from either the human or the robot. However, existing human action datasets are primarily designed for conventional Human-Robot Interaction (HRI) and fail to meet the unique requirements of N-HRI due to limited data, data modalities, task categories, and diversity in subjects and environments. To address this, we introduce ACTIVE, a large-scale human action dataset focused on ACtions from RoboTIc ViEw. Our dataset includes 30 action categories, 80 participants and 46,868 video instances, encompassing both point cloud and RGB modalities. During data capture, participants perform a range of human actions in diverse environments at varying distances (from 3m to 50m), while also executing disturbing motions, and with the robot itself in different states of motion. To recognize actions from a robotic view, we propose ACTIVE-PC, a Point Cloud-based method for ACTIVE dataset, which is able to recognize human actions at long distances using our proposed Multilevel Neighborhood Sampling, Layered Recognizers, and Elastic Ellipse Query, along with precise decoupling of kinematic interference and human actions. Experimental results verify the effectiveness of our method. Our project page is https://active2750.github.io/.

This paper introduces Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning (TARO), a novel framework for high-fidelity and temporally coherent video-to-audio synthesis. Built upon flow-based transformers, which offer stable training and continuous transformations for enhanced synchronization and audio quality, TARO introduces two key innovations: (1) Timestep-Adaptive Representation Alignment (TRA), which dynamically aligns latent representations by adjusting alignment strength based on the noise schedule, ensuring smooth evolution and improved fidelity, and (2) Onset-Aware Conditioning (OAC), which integrates onset cues that serve as sharp event-driven markers of audio-relevant visual moments to enhance synchronization with dynamic visual events. Extensive experiments on the VGGSound and Landscape datasets demonstrate that TARO outperforms prior methods, achieving relatively 53\% lower Frechet Distance (FD), 29\% lower Frechet Audio Distance (FAD), and a 97.19\% Alignment Accuracy, highlighting its superior audio quality and synchronization precision.

Wed 22 Oct. 14:15 - 16:15 PDT

#399
UniFuse: A Unified All-in-One Framework for Multi-Modal Medical Image Fusion Under Diverse Degradations and Misalignments

Dayong Su · Yafei Zhang · Huafeng Li · Jinxing Li · Yu Liu

Current multimodal medical image fusion typically assumes that source images are of high quality and perfectly aligned at the pixel level. Its effectiveness heavily relies on these conditions and often deteriorates when handling misaligned or degraded medical images. To address this, we propose UniFuse, a general fusion framework. By embedding a degradation-aware prompt learning module, UniFuse seamlessly integrates multi-directional information from input images and correlates cross-modal alignment with restoration, enabling joint optimization of both tasks within a unified framework. Additionally, we design an Omni Unified Feature Representation scheme, which leverages Spatial Mamba to encode multi-directional features and mitigate modality differences in feature alignment. To enable simultaneous restoration and fusion within an All-in-One configuration, we propose a Universal Feature Restoration \& Fusion module, incorporating the Adaptive LoRA Synergistic Network (ALSN) based on LoRA principles. By leveraging ALSN’s adaptive feature representation along with degradation-type guidance, we enable joint restoration and fusion within a single-stage framework. Compared to staged approaches, Unifuse unifies alignment, restoration, and fusion within a single framework. Experimental results across multiple datasets demonstrate the method’s effectiveness and significant advantages over existing approaches.

Wed 22 Oct. 14:15 - 16:15 PDT

#400
Highlight
DexVLG: Dexterous Vision-Language-Grasp Model at Scale

Jiawei He · Danshi Li · Xinqiang Yu · Zekun Qi · Wenyao Zhang · Jiayi Chen · Zhaoxiang Zhang · Zhizheng Zhang · Li Yi · He Wang

As large models begin to gain momentum, vision-language foundation models are enabling robots to generalizably perform more and more tasks. However, due to the difficulty in data collection, the benefits are limited with simple embodiments. In this paper, we present \textbf{DexVLG}, a vision-language model that predicts language instruction-aligned dexterous grasp poses given single-view RGBD perception. To achieve this, we first synthesize a dataset of 170M dexterous grasp poses aligned with semantic parts on 174k objects in simulation, paired with informative part-level captions. With this large-scale dataset named \textbf{DexGraspNet 3.0}, we train a flow-matching VLM to generate instruction-aligned grasp poses on tabletop objects. To evaluate DexVLG, we curate benchmarks in physics-based simulation and perform real-world experiments. Our extensive experiments demonstrate DexVLG's great zero-shot generalizability, achieving over 76\% zero-shot execution success rate and state-of-the art part grasp accuracy in simulation, and demonstrate successful part-aligned grasps on real-world objects.

Wed 22 Oct. 14:15 - 16:15 PDT

#401
Towards Explicit Exoskeleton for the Reconstruction of Complicated 3D Human Avatars

Yifan Zhan · Qingtian Zhu · Muyao Niu · Mingze Ma · Jiancheng Zhao · Zhihang Zhong · Xiao Sun · Yu Qiao · Yinqiang Zheng

In this paper, we highlight a critical yet often overlooked factor in most 3D human tasks, namely modeling complicated 3D human with with hand-held objects or loose-fitting clothing. It is known that the parameterized formulation of SMPL is able to fit human skin; while hand-held objects and loose-fitting clothing, are difficult to get modeled within the unified framework, since their movements are usually decoupled with the human body.To enhance the capability of SMPL skeleton in response to this situation, we propose a growth strategy that enables the joint tree of the skeleton to expand adaptively. Specifically, our method, called ToMiE, consists of parent joints localization and external joints optimization. For parent joints localization, we employ a gradient-based approach guided by both LBS blending weights and motion kernels.Once the external joints are obtained, we proceed to optimize their transformations in SE(3) across different frames, enabling rendering and explicit animation.ToMiE manages to outperform other methods across various cases with hand-held objects and loose-fitting clothing, not only in rendering quality but also by offering free animation of grown joints, thereby enhancing the expressive ability of SMPL skeleton for a broader range of applications.

Wed 22 Oct. 14:15 - 16:15 PDT

#402
AV-Flow: Transforming Text to Audio-Visual Human-like Interactions

Aggelina Chatziagapi · Louis-Philippe Morency · Hongyu Gong · Michael Zollhöfer · Dimitris Samaras · Alexander Richard

We introduce AV-Flow, an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input. In contrast to prior work that assumes an existing speech signal, we synthesize speech and vision jointly. We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose; all generated from just text characters. The core premise of our approach lies in the architecture of our two parallel diffusion transformers. Intermediate highway connections ensure communication between the audio and visual modalities, and thus, synchronized speech intonation and facial dynamics (e.g., eyebrow motion). Our model is trained with flow matching, leading to expressive results and fast inference. In case of dyadic conversations, AV-Flow produces an always-on avatar, that actively listens and reacts to the audio-visual input of a user. Through extensive experiments, we show that our method outperforms prior work, synthesizing natural-looking 4D talking avatars.

Wed 22 Oct. 14:15 - 16:15 PDT

#403
Democratizing High-Fidelity Co-Speech Gesture Video Generation

Xu Yang · Shaoli Huang · Shenbo Xie · Xuelin Chen · Yifei Liu · Changxing Ding

Co-speech gesture video generation aims to synthesize realistic, audio-aligned videos of speakers, complete with synchronized facial expressions and body gestures. This task presents challenges due to the significant one-to-many mapping between audio and visual content, further complicated by the scarcity of large-scale public datasets and high computational demands. We propose a lightweight framework that utilizes 2D full-body skeletons as an efficient auxiliary condition to bridge audio signals with visual outputs. Our approach introduces a diffusion model conditioned on fine-grained audio segments and a skeleton extracted from the speaker's reference image, predicting skeletal motions through skeleton-audio feature fusion to ensure strict audio coordination and body shape consistency. The generated skeletons are then fed into an off-the-shelf human video generation model with the speaker's reference image to synthesize high-fidelity videos. To democratize research, we present CSG-405—the first public dataset with 405 hours of high-resolution videos across 71 speech types, annotated with 2D skeletons and diverse speaker demographics. Experiments show that our method exceeds state-of-the-art approaches in visual quality and synchronization while generalizing across speakers and contexts. Code, models, and CSG-405 will be publicly released.

Wed 22 Oct. 14:15 - 16:15 PDT

#404
Fine-Grained 3D Gaussian Head Avatars Modeling from Static Captures via Joint Reconstruction and Registration

Yuan Sun · Xuan Wang · Cong Wang · WeiLi Zhang · Yanbo Fan · Yu Guo · Fei Wang

Recently, 3D head avatar modeling based on 3D Gaussians has demonstrated significant advantages in rendering quality and efficiency, provided there is sufficient data. Some efforts have begun to train prior models on large datasets to develop generalizable 3D Gaussian head avatar modeling methods. Unfortunately, due to the limited expressive power of identity-shared 3D representations, the prior-based modeling often result in degenerated rendering quality. To overcome this limitation, we propose to formulate the 3D Gaussian head avatar modeling as a joint reconstruction and registration problem. Given static input images (e.g., a short mobile phone capture), we optimize two sets of 3D Gaussians: the prior-based one possesses complete animation rigging information inferred from the prior model and produces plausible modeling results, while the prior-free one is used to more freely capture the fine-grained geometric and texture details in the input images. Additionally, we simultaneously solve the registration problem between the two 3D Gaussian sets. On one hand, the registration results will provide binding information for the prior-free reconstruction to make it animatable. On the other hand, during optimization, the prior-based Gaussian can regularize the prior-free reconstruction to resist overfitting and perform good in novel expressions. Finally, we merge the parts of the prior-based reconstruction that are occluded in the input images with the prior-free reconstruction set, and then apply appropriate post-processing strategies (such as teeth enhancement) to produce a complete head avatar. We evaluated our method on the public Nersemble dataset and our own in-the-wild data. The experiments demonstrate that, under the same experimental settings, our method significantly improves modeling quality and provides better support for detailed modeling at higher resolutions.

Wed 22 Oct. 14:15 - 16:15 PDT

#405
Motion-2-to-3: Leveraging 2D Motion Data for 3D Motion Generations

Ruoxi Guo · Huaijin Pi · Zehong Shen · Qing Shuai · zechenhu zechenhu · Zhumei Wang · Yajiao Dong · Ruizhen Hu · Taku Komura · Sida Peng · Xiaowei Zhou

Text-driven human motion synthesis has showcased its potential for revolutionizing motion design in the movie and game industry.Existing methods often rely on 3D motion capture data, which requires special setups, resulting in high costs for data acquisition, ultimately limiting the diversity and scope of human motion. In contrast, 2D human videos offer a vast and accessible source of motion data, covering a wider range of styles and activities.In this paper, we explore the use of 2D human motion extracted from videos as an alternative data source to improve text-driven 3D motion generation.Our approach introduces a novel framework that disentangles local joint motion from global movements, enabling efficient learning of local motion priors from 2D data.We first train a single-view 2D local motion generator on a large dataset of text-2D motion pairs.Then we fine-tune the generator with 3D data, transforming it into a multi-view generator that predicts view-consistent local joint motion and root dynamics.Evaluations on the well-acknowledged datasets and novel text prompts demonstrate that our method can efficiently utilizes 2D data, supporting a wider range of realistic 3D human motion generation.

Wed 22 Oct. 14:15 - 16:15 PDT

#406
IM-LUT: Interpolation Mixing Look-Up Tables for Image Super-Resolution

Sejin Park · Sangmin Lee · Kyong Hwan Jin · Seung-Won Jung

Super-resolution (SR) has been a pivotal task in image processing, aimed at enhancing image resolution across various applications. Recently, look-up table (LUT)-based approaches have attracted interest due to their efficiency and performance. However, these methods are typically designed for fixed scale factors, making them unsuitable for arbitrary-scale image SR (ASISR). Existing ASISR techniques often employ implicit neural representations, which come with considerable computational cost and memory demands. To address these limitations, we propose Interpolation Mixing LUT (IM-LUT), a novel framework that operates ASISR by learning to blend multiple interpolation functions to maximize their representational capacity. Specifically, we introduce IM-Net, a network trained to predict mixing weights for interpolation functions based on local image patterns and the target scale factor. To enhance efficiency of interpolation-based methods, IM-Net is transformed into IM-LUT, where LUTs are employed to replace computationally expensive operations, enabling lightweight and fast inference on CPUs while preserving reconstruction quality. Experimental results on several benchmark datasets demonstrate that IM-LUT consistently achieves a superior balance between image quality and efficiency compared to existing methods, highlighting its potential as a promising solution for resource-constrained applications.

Wed 22 Oct. 14:15 - 16:15 PDT

#407
TemCoCo: Temporally Consistent Multi-modal Video Fusion with Visual-Semantic Collaboration

Gong Meiqi · Hao Zhang · Xunpeng Yi · Linfeng Tang · Jiayi Ma

Existing multi-modal fusion methods typically extend image fusion techniques directly to video fusion tasks, which discard inherent temporal information and struggle to maintain temporal consistency between video frames. To address this limitation, we propose a comprehensive method specifically designed for multi-modal video fusion, leveraging a temporally consistent framework with visual-semantic collaboration to simultaneously ensure visual fidelity, semantic accuracy, and temporal consistency. First, we introduce a visual-semantic interaction module consisting of a semantic branch and a visual branch, with Dinov2 and VGG19 employed for distillation. This approach enables the simultaneous and targeted enhancement of both the visual and semantic representations of videos for the first time. Second, we pioneer integrate the video degradation enhancement task into the video fusion pipeline by constructing a temporal cooperative module, which leverages temporal dependencies to facilitate weak information recovery. Third, to ensure temporal consistency, we embed a temporal-enhanced mechanism into the network and devise a temporal loss to guide the optimization process. Finally, we introduce two innovative metrics tailored for video fusion, aimed at evaluating the temporal consistency of the generated fused videos. Extensive experimental results on public video datasets validate the superiority of our method.

Wed 22 Oct. 14:15 - 16:15 PDT

#408
Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos

Rundong Luo · Matthew Wallingford · Ali Farhadi · Noah Snavely · Wei-Chiu Ma

360$^\circ$ videos have emerged as a promising medium to represent our dynamic visual world. Compared to the "tunnel vision" of standard cameras, their borderless field of view offers a more complete perspective of our surroundings. While existing video models excel at producing standard videos, their ability to generate full panoramic videos remains elusive. In this paper, we investigate the task of video-to-360$^\circ$ generation: given a perspective video as input, our goal is to generate a full panoramic video that is consistent with the original video. Unlike conventional video generation tasks, the output's field of view is significantly larger, and the model is required to have a deep understanding of both the spatial layout of the scene and the dynamics of objects to maintain spatio-temporal consistency. To address these challenges, we first leverage the abundant 360$^\circ$ videos available online and develop a high-quality data filtering pipeline to curate pairwise training data. We then carefully design a series of geometry- and motion-aware operations to facilitate the learning process and improve the quality of 360$^\circ$ video generation. Experimental results demonstrate that our model can generate realistic and coherent 360$^\circ$ videos from in-the-wild perspective video. In addition, we showcase its potential applications, including video stabilization, camera viewpoint control, and interactive visual question answering.

Wed 22 Oct. 14:15 - 16:15 PDT

#409
SpinMeRound: Consistent Multi-View Identity Generation Using Diffusion Models

Stathis Galanakis · Alexandros Lattas · Stylianos Moschoglou · Bernhard Kainz · Stefanos Zafeiriou

Despite recent progress in diffusion models, generating realistic head portraits from novel viewpoints remains a significant challenge in computer vision. Most current approaches are constrained to limited angular ranges, predominantly focusing on frontal or near-frontal views. Moreover, although the recent emerging large-scale diffusion models have been proven robust in handling 3D scenes, they underperform on facial data, given their complex structure and the uncanny valley pitfalls. In this paper, we propose SpinMeRound, a diffusion-based approach designed to generate consistent and accurate head portraits from novel viewpoints. By leveraging a number of input views alongside an identity embedding, our method effectively synthesizes diverse viewpoints of a subject whilst robustly maintaining its unique identity features. Through experimentation, we showcase our model's generation capabilities in full head synthesis, while beating current state-of-the-art multi-view diffusion models.

Wed 22 Oct. 14:15 - 16:15 PDT

#410
Highlight
DIMO: Diverse 3D Motion Generation for Arbitrary Objects

Linzhan Mou · Jiahui Lei · Chen Wang · Lingjie Liu · Kostas Daniilidis

We present DIMO, a generative approach capable of generating diverse 3D motions for arbitrary objects from a single image. The core idea of our work is to leverage the rich priors in well-trained video models to extract the common motion patterns and then embed them into a shared low-dimensional latent space. Specifically, we first generate multiple videos of the same object with diverse motions. We then embed each motion into a latent vector and train a shared motion decoder to learn the distribution of motions represented by a structured and compact motion representation, i.e., neural key point trajectories. The canonical 3D Gaussians are then driven by these key points and fused to model the geometry and appearance. During inference time with learned latent space, we can instantly sample diverse 3D motions in a single-forward pass and support several interesting applications including 3D motion interpolation and language-guided motion generation.

Wed 22 Oct. 14:15 - 16:15 PDT

#411
OpenAnimals: Revisiting Person Re-Identification for Animals Towards Better Generalization

Saihui Hou · Panjian Huang · Zengbin Wang · Yuan Liu · Zeyu Li · Man Zhang · Yongzhen Huang

This paper addresses the challenge of animal re-identification, an emerging field that shares similarities with person re-identification but presents unique complexities due to the diverse species, environments and poses. To facilitate research in this domain, we introduce OpenAnimals, a flexible and extensible codebase designed specifically for animal re-identification. We conduct a comprehensive study by revisiting several state-of-the-art person re-identification methods, including BoT, AGW, SBS, and MGN, and evaluate their effectiveness on animal re-identification benchmarks such as HyenaID, LeopardID, SeaTurtleID, and WhaleSharkID. Our findings reveal that while some techniques generalize well, many do not, underscoring the significant differences between the two tasks. To bridge this gap, we propose ARBase, a strong Base model tailored for Animal Re-identification, which incorporates insights from extensive experiments and introduces simple yet effective animal-oriented designs. Experiments demonstrate that ARBase consistently outperforms existing baselines, achieving state-of-the-art performance across various benchmarks.

Wed 22 Oct. 14:15 - 16:15 PDT

#412
Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation

Yingjie Chen · Yifang Men · Yuan Yao · Miaomiao Cui · Liefeng Bo

Motion-controllable image animation is a fundamental task with a wide range of potential applications. Recent works have made progress in controlling camera or object motion via various motion representations, while they still struggle to support collaborative camera and object motion control with adaptive control granularity. To this end, we introduce 3D-aware motion representation and propose an image animation framework, called Perception-as-Control, to achieve fine-grained collaborative motion control. Specifically, we construct 3D-aware motion representation from a reference image, manipulate it based on interpreted user instructions, and perceive it from different viewpoints. In this way, camera and object motions are transformed into intuitive and consistent visual changes. Then, our framework leverages the perception results as motion control signals, enabling it to support various motion-related video synthesis tasks in a unified and flexible way. Experiments demonstrate the superiority of the proposed approach.

Wed 22 Oct. 14:15 - 16:15 PDT

#413
Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking

Yunhao Li · Yifan Jiao · Dan Meng · Heng Fan · Libo Zhang

Open-Vocabulary Multi-Object Tracking (OV-MOT) aims to enable approaches to track objects without being limited to a predefined set of categories. Current OV-MOT methods typically rely primarily on instance-level detection and association, often overlooking trajectory information, which is a unique and essential information of tracking tasks. Utilizing trajectory information can enhance association stability and classification accuracy, especially in cases of occlusion and category ambiguity, thereby improving adaptability to novel classes. Thus motivated, in this paper we propose \textbf{TRACT}, an open-vocabulary tracker that leverages trajectory information to improve both object association and classification in OV-MOT. Specially, we introduce \textit{Trajectory Consistency Reinforcement} (\textbf{TCR}) strategy to maintain continuity across frames while tracking. Furthermore, we propose \textbf{TraCLIP}, a plug-and-play trajectory classification module. It integrates \textit{Trajectory Feature Aggregation} (\textbf{TFA}) and \textit{Trajectory Semantic Enrichment} (\textbf{TSE}) strategies to fully leverage trajectory information from visual and language perspectives, respectively. Experiments on the OV-TAO benchmark demonstrate that our approach significantly improves tracking performance, highlighting trajectory information as a valuable asset for OV-MOT.

Wed 22 Oct. 14:15 - 16:15 PDT

#414
UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization

Junjie He · Yifeng Geng · Liefeng Bo

This paper presents UniPortrait, an innovative human image personalization framework that unifies single- and multi-ID customization with high face fidelity, extensive facial editability, free-form input description, and diverse layout generation. UniPortrait consists of only two plug-and-play modules: an ID embedding module and an ID routing module. The ID embedding module extracts versatile editable facial features with a decoupling strategy for each ID and embeds them into the context space of diffusion models. The ID routing module then combines and distributes these embeddings adaptively to their respective regions within the synthesized image, achieving the customization of single and multiple IDs. With a carefully designed two-stage training scheme, UniPortrait achieves superior performance in both single- and multi-ID customization. Quantitative and qualitative experiments demonstrate the advantages of our method over existing approaches as well as its good scalability, e.g., the universal compatibility with existing generative control tools.

Wed 22 Oct. 14:15 - 16:15 PDT

#415
SAMPLE: Semantic Alignment through Temporal-Adaptive Multimodal Prompt Learning for Event-Based Open-Vocabulary Action Recognition

Jing Wang · Rui Zhao · Ruiqin Xiong · Xingtao Wang · Xiaopeng Fan · Tiejun Huang

Open-vocabulary action recognition (OVAR) extends recognition systems to identify unseen action categories. While large-scale vision-language models (VLMs) like CLIP have enabled OVAR in image domains, their adaptation to event data remains underexplored. Event cameras offer high temporal resolution and inherent privacy preservation, making them suitable for capturing fine-grained motion dynamics. However, leveraging event data for OVAR presents challenges: 1) bridging the domain gap between static image-based models and event streams, and 2) preserving the generalization capabilities of pretrained VLMs in open-vocabulary settings. In this paper, we propose SAMPLE, a lightweight adaptation of VLMs for event-based action recognition, balancing supervised and open-vocabulary performance. We introduce a \textit{Temporal-Adaptive Multimodal Prompt Learning} strategy that can be divided into: 1) Unimodal prompt on both the event and text branches to learn the data distribution 2) Event-Text cross-modal prompt for representation space alignment 3) Temporal-Adaptive prompt to model temporal dependencies across event data. Extensive evaluations demonstrate that SAMPLE outperforms prior methods across fully supervised, few-shot, base-to-novel and zero-shot settings. Notably, in zero-shot scenarios, SAMPLE achieves gains of +15.46%, +29.76%, and +23.79% on SeAct, DVS128Gesture, and PAF respectively with less commute cost. Our codes are included in the supplementary materials. The codes and models will be publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#416
Music-Aligned Holistic 3D Dance Generation via Hierarchical Motion Modeling

LI XIAOJIE · Ronghui Li · Shukai Fang · Shuzhao Xie · Xiaoyang Guo · Jiaqing Zhou · Junkun Peng · Zhi Wang

Well-coordinated, music-aligned holistic dance enhances emotional expressiveness and audience engagement. However, generating such dances remains challenging due to the scarcity of holistic 3D dance datasets, the difficulty of achieving cross-modal alignment between music and dance, and the complexity of modeling interdependent motion across the body, hands, and face. To address these challenges, we introduce SoulDance, a high-precision music-dance paired dataset captured via professional motion capture systems, featuring meticulously annotated holistic dance movements. Building on this dataset, we propose SoulNet, a framework designed to generate music-aligned, kinematically coordinated holistic dance sequences. SoulNet consists of three principal components: (1) Hierarchical Residual Vector Quantization, which models complex, fine-grained motion dependencies across the body, hands, and face; (2) Music-Aligned Generative Model, which composes these hierarchical motion units into expressive and coordinated holistic dance; (3) Music-Motion Retrieval Module, a pre-trained cross-modal model that functions as a music-dance alignment prior, ensuring temporal synchronization and semantic coherence between generated dance and input music throughout the generation process. Extensive experiments demonstrate that SoulNet significantly surpasses existing approaches in generating high-quality, music-coordinated, and well-aligned holistic 3D dance sequences. Additional resources are available on our project: https://anonymous.4open.science/w/SoulDance-BBD3/

Wed 22 Oct. 14:15 - 16:15 PDT

#417
Highlight
Structure Matters: Revisiting Boundary Refinement in Video Object Segmentation

Guanyi Qin · Ziyue Wang · Daiyun Shen · Haofeng Liu · Hantao Zhou · Junde Wu · Runze Hu · Yueming Jin

Given an object mask, Semi-supervised Video Object Segmentation (SVOS) technique aims to track and segment the object across video frames, serving as a fundamental task in computer vision. Although recent memory-based methods demonstrate potential, they often struggle with scenes involving occlusion, particularly in handling object interactions and high feature similarity. To address these issues and meet the real-time processing requirements of downstream applications, in this paper, we propose a novel bOundary Amendment video object Segmentation method with Inherent Structure refinement, hereby named OASIS. Specifically, a lightweight structure refinement module is proposed to enhance segmentation accuracy. With the fusion of rough edge priors captured by the Canny filter and stored object features, the module can generate an object-level structure map and refine the representations by highlighting boundary features. Evidential learning for uncertainty estimation is introduced to further address challenges in occluded regions. The proposed method, OASIS, maintains an efficient design, yet extensive experiments on challenging benchmarks demonstrate its superior performance and competitive inference speed compared to other state-of-the-art methods, i.e., achieving the F values of 91.6 (vs. 89.7 on DAVIS-17 validation set) and G values of 86.6 (vs. 86.2 on YouTubeVOS 2019 validation set) while maintaining a competitive speed of 48 FPS on DAVIS. Checkpoints, logs, and codes will be available upon publication.

Wed 22 Oct. 14:15 - 16:15 PDT

#418
NoiseController: Towards Consistent Multi-view Video Generation via Noise Decomposition and Collaboration

Haotian Dong · Xin WANG · Di Lin · Yipeng Wu · Qin Chen · Ruonan Liu · Kairui Yang · Ping Li · Qing Guo

High-quality video generation is crucial for many fields, including the film industry and autonomous driving. However, generating videos with spatiotemporal consistencies remains challenging. Current methods typically utilize attention mechanisms or modify noise to achieve consistent videos, neglecting global spatiotemporal information that could help ensure spatial and temporal consistency during video generation. In this paper, we propose the NoiseController, consisting of Multi-Level Noise Decomposition, Multi-Frame Noise Collaboration, and Joint Denoising, to enhance spatiotemporal consistencies in video generation. In multi-level noise decomposition, we first decompose initial noises into scene-level foreground/background noises, capturing distinct motion properties to model multi-view foreground/background variations. Furthermore, each scene-level noise is further decomposed into individual-level shared and residual components. The shared noise preserves consistency, while the residual component maintains diversity. In multi-frame noise collaboration, we introduce an inter-view spatiotemporal collaboration matrix and an intra-view impact collaboration matrix, which captures mutual cross-view effects and historical cross-frame impacts to enhance video quality. The joint denoising contains two parallel denoising U-Nets to remove each scene-level noise, mutually enhancing video generation. We evaluate our NoiseController on public datasets focusing on video generation and downstream tasks, demonstrating its state-of-the-art performance.

Wed 22 Oct. 14:15 - 16:15 PDT

#419
FED-PsyAU: Privacy-Preserving Micro-Expression Recognition via Psychological AU Coordination and Dynamic Facial Motion Modeling

Jingting Li · Yu Qian · Lin Zhao · Su-Jing Wang

Micro-expressions (MEs) are brief, low-intensity, often localized facial expressions. They could reveal genuine emotions individuals may attempt to conceal, valuable in contexts like criminal interrogation and psychological counseling. However, ME recognition (MER) faces challenges, such as small sample sizes and subtle features, which hinder efficient modeling. Additionally, real-world applications encounter ME data privacy issues, leaving the task of enhancing recognition across settings under privacy constraints largely unexplored. To address these issues, we propose a FED-PsyAU research framework. We begin with a psychological study on the coordination of upper and lower facial action units (AUs) to provide structured prior knowledge of facial muscle dynamics. We then develop a DPK-GAT network that combines these psychological priors with statistical AU patterns, enabling hierarchical learning of facial motion features from regional to global levels, effectively enhancing MER performance. Additionally, our federated learning framework advances MER capabilities across multiple clients without data sharing, preserving privacy and alleviating the limited-sample issue for each client. Extensive experiments on commonly-used ME databases demonstrate the effectiveness of our approach.

Wed 22 Oct. 14:15 - 16:15 PDT

#420
MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers

Yuechen Zhang · YaoYang Liu · Bin Xia · Bohao PENG · Zexin Yan · Eric Lo · Jiaya Jia

We present MagicMirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion. While recent advances in video diffusion models have shown impressive capabilities in text-to-video generation, maintaining consistent identity while producing natural motion remains challenging. Previous methods either require person-specific fine-tuning or struggle to balance identity preservation with motion diversity. Built upon Video Diffusion Transformers, our method introduces three key components: (1) a dual-branch facial feature extractor that captures both identity and structural features, (2) a lightweight cross-modal adapter with Conditioned Adaptive Normalization for efficient identity integration, and (3) a two-stage training strategy combining synthetic identity pairs with video data. Extensive experiments demonstrate that MagicMirror effectively balances identity consistency with natural motion, outperforming existing methods across multiple metrics while requiring minimal parameters added. The code and model will be made publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#421
MotionDiff: Training-free Zero-shot Interactive Motion Editing via Flow-assisted Multi-view Diffusion

Yikun Ma · Yiqing Li · Jiawei Wu · Xing Luo · Zhi Jin

Generative models have made remarkable advancements and are capable of producing high-quality content. However, performing controllable editing with generative models remains challenging, due to their inherent uncertainty in outputs. This challenge is praticularly pronounced in motion editing, which involves the processing of spatial information. While some physics-based generative methods have attempted to implement motion editing, they typically operate on single-view images with simple motions, such as translation and dragging. These methods struggle to handle complex rotation and stretching motions and ensure multi-view consistency, often necessitating resource-intensive retraining. To address these challenges, we propose MotionDiff, a training-free zero-shot diffusion method that leverages optical flow for complex multi-view motion editing. Specifically, given a static scene, users can interactively select objects of interest to add motion priors. The proposed Point Kinematic Model (PKM) then estimates corresponding multi-view optical flows during the Multi-view Flow Estimation Stage (MFES). Subsequently, these optical flows are utilized to generate multi-view motion results through decoupled motion representation in the Multi-view Motion Diffusion Stage (MMDS). Extensive experiments demonstrate that MotionDiff outperforms other physics-based generative motion editing methods in achieving high-quality multi-view consistent motion results. Notably, MotionDiff does not require retraining, enabling users to conveniently adapt it for various down-stream tasks.

Wed 22 Oct. 14:15 - 16:15 PDT

#422
Dense Policy: Bidirectional Autoregressive Learning of Actions

Yue Su · Xinyu Zhan · Hongjie Fang · Han Xue · Hao-Shu Fang · Yong-Lu Li · Cewu Lu · Lixin Yang

Mainstream visuomotor policies predominantly rely on generative models for holistic action prediction, while current autoregressive policies, predicting the next token or chunk, have shown suboptimal results. This motivates a search for more effective learning methods to unleash the potential of autoregressive policies for robotic manipulation. This paper introduces a bidirectionally expanded learning approach, termed Dense Policy, to establish a new paradigm for autoregressive policies in action prediction. It employs a lightweight encoder-only architecture to iteratively unfold the action sequence from an initial single frame into the target sequence in a coarse-to-fine manner with logarithmic-time inference. Extensive experiments validate that our dense policy has superior autoregressive learning capabilities and can surpass existing holistic generative policies. Our policy, example data, and training code will be publicly available upon publication.

Wed 22 Oct. 14:15 - 16:15 PDT

#423
PUMPS: Skeleton-Agnostic Point-based Universal Motion Pre-Training for Synthesis in Human Motion Tasks

Clinton A Mo · Kun Hu · Chengjiang Long · Dong Yuan · Wan-Chi Siu · Zhiyong Wang

The motion skeleton is a core data structure of 3D animation workflows, producing character motions by posing a pre-defined bone hierarchy. Motion data is largely incompatible across skeletons with proportional and/or hierarchical differences, raising long-standing challenges for data-driven motion synthesis. To address this, Temporal Point Clouds (TPC) have emerged as a universal, cross-compatible motion representation, using temporally consistent points that map motion trajectories. While TPCs have demonstrated reversibility with skeletal motions, their role is currently limited to enabling cross-compatibility, whereas we believe motion tasks can be learned directly in the TPC medium. This would require TPC motion synthesis capabilities, which is an unexplored field due to its unique temporal consistency and point identity requirements.In this paper, we propose PUMPS, the primordial auto-encoder architecture for TPC data. It reduces point cloud frames independently into sampleable feature vectors, from which a decoder efficiently extracts distinct temporal points using latent Gaussian noise vectors as sampling identifiers. We introduce linear assignment-based point pairing to optimise the TPC reconstruction process without requiring expensive point-wise attention mechanisms in the architecture. Using the auto-encoder, we produce a pre-trained motion synthesis model capable of performing motion prediction, transition generation, and keyframe interpolation tasks. PUMPS performs remarkably well even without native dataset supervision, matching state-of-the-art performance in its pre-training tasks, and outperforming existing methods when fine-tuned for skeletal motion denoising and estimation tasks.

Wed 22 Oct. 14:15 - 16:15 PDT

#424
Highlight
Dirichlet-Constrained Variational Codebook Learning for Temporally Coherent Video Face Restoration

Baoyou Chen · Ce Liu · Weihao Yuan · Zilong Dong · Siyu Zhu

Video face restoration faces a critical challenge in maintaining temporal consistency while recovering fine facial details from degraded inputs. This paper presents a novel approach that extends Vector-Quantized Variational Autoencoders (VQ-VAEs), pretrained on static high-quality portraits, into a video restoration framework through variational latent space modeling. Our key innovation lies in reformulating discrete codebook representations as Dirichlet-distributed continuous variables, enabling probabilistic transitions between facial features across frames. A spatio-temporal Transformer architecture jointly models inter-frame dependencies and predicts latent distributions, while a Laplacian-constrained reconstruction loss combined with perceptual (LPIPS) regularization enhances both pixel accuracy and visual quality. Comprehensive evaluations on blind face restoration, video inpainting, and facial colorization tasks demonstrate state-of-the-art performance. This work establishes an effective paradigm for adapting intensive image priors, pretrained on high-quality images, to video restoration while addressing the critical challenge of flicker artifacts.

Wed 22 Oct. 14:15 - 16:15 PDT

#425
Seeing Through Deepfakes: A Human-Inspired Framework for Multi-Face Detection

Juan Hu · Shaojing Fan · Terence Sim

Multi-face deepfake videos are becoming increasingly prevalent, often appearing in natural social settings that challenge existing detection methods. Most current approaches excel at single-face detection but struggle in multi-face scenarios, due to a lack of awareness of crucial contextual cues. In this work, we develop a novel approach that leverages human cognition to analyze and defend against multi-face deepfake videos. Through a series of human studies, we systematically examine how people detect deepfake faces in social settings. Our quantitative analysis reveals four key cues humans rely on: scene-motion coherence, inter-face appearance compatibility, interpersonal gaze alignment, and face-body consistency. Guided by these insights, we introduce \textsf{HICOM}, a novel framework designed to detect every fake face in multi-face scenarios. Extensive experiments on benchmark datasets show that \textsf{HICOM} improves average accuracy by 3.3\% in in-dataset detection and 2.8\% under real-world perturbations. Moreover, it outperforms existing methods by 5.8\% on unseen datasets, demonstrating the generalization of human-inspired cues. \textsf{HICOM} further enhances interpretability by incorporating an LLM to provide human-readable explanations, making detection results more transparent and convincing. Our work sheds light on involving human factors to enhance defense against deepfakes.

Wed 22 Oct. 14:15 - 16:15 PDT

#426
MistSense: Versatile Online Detection of Procedural and Execution Mistakes

Constantin Patsch · Yuankai Wu · Marsil Zakour · Driton Salihu · Eckehard Steinbach

Online mistake detection is crucial across various domains, ranging from industrial automation to educational applications, as mistakes can be corrected by the human operator after their detection due to the continuous inference on a video stream. While prior research mainly addresses procedural errors that often relate to temporal and ordering information, identifying a broader range of error types is essential for real-world implementation. In this work, we present MistSense, an approach for online mistake identification that includes this versatility by considering both procedural errors, which involve incorrect action sequences, and execution errors, such as motor inaccuracies or improper equipment use. Our method integrates RGB and hand pose features to capture fine-grained contextual cues in order to detect a mistake. By jointly modeling spatial and sequential aspects of human actions, our framework enables robust and adaptive error detection in dynamic environments. Once a mistake has been detected, we leverage a large language model (LLM) which provides an error explanation that gives the user further insights into why an action has been identified as a mistake. The evaluation on common mistake detection benchmarks shows the effectiveness of our approach.

Wed 22 Oct. 14:15 - 16:15 PDT

#427
SEREP: Semantic Facial Expression Representation for Robust In-the-Wild Capture and Retargeting

Arthur Josi · Luiz Gustavo Hafemann · Abdallah Dib · Emeline Got · Rafael M. O. Cruz · Marc-André Carbonneau

Monocular facial performance capture in-the-wild is challenging due to varied capture conditions, face shapes, and expressions. Most current methods rely on linear 3D Morphable Models, which represent facial expressions independently of identity at the vertex displacement level. We propose SEREP (Semantic Expression Representation), a model that disentangles expression from identity at the semantic level. We start by learning an expression representation from high quality 3D data of unpaired facial expressions. Then, we train a model to predict expression from monocular images relying on a novel semi-supervised scheme using low quality synthetic data. In addition, we introduce MultiREX, a benchmark addressing the lack of evaluation resources for the expression capture task. Our experiments show that SEREP outperforms state-of-the-art methods, capturing challenging expressions and transferring them to new identities.

Wed 22 Oct. 14:15 - 16:15 PDT

#428
Clink! Chop! Thud! - Learning Object Sounds from Real-World Interactions

Mengyu Yang · Yiming Chen · Haozheng Pei · Siddhant Agarwal · Arun Vasudevan · James Hays

Can a model distinguish between the sound of a spoon hitting a hardwood floor versus a carpeted one? Everyday object interactions produce sounds unique to the objects involved. We introduce the sounding object detection task to evaluate a model's ability to link these sounds to the objects directly responsible. Inspired by human perception, our multimodal object-aware framework learns from in-the-wild egocentric videos. Our model enforces object-awareness by using a slot attention visual encoder. We then develop an automatic method to compute segmentation masks of the objects involved to guide the model's focus towards the most informative regions of the interaction. We demonstrate state of the art performance on our new task along with existing multimodal action understanding tasks.

Wed 22 Oct. 14:15 - 16:15 PDT

#429
LUT-Fuse: Towards Extremely Fast Infrared and Visible Image Fusion via Distillation to Learnable Look-Up Tables

Xunpeng Yi · yibing zhang · Xinyu Xiang · Qinglong Yan · Han Xu · Jiayi Ma

Current advanced research on infrared and visible image fusion primarily focuses on improving fusion performance, often neglecting the applicability on real-time fusion devices. In this paper, we propose a novel approach that towards extremely fast fusion via distillation to learnable lookup tables specifically designed for image fusion, termed as LUT-Fuse. Firstly, we develop a look-up table structure that utilizing low-order approximation encoding and high-level joint contextual scene encoding, which is well-suited for multi-modal fusion. Moreover, given the lack of ground truth in multi-modal image fusion, we naturally proposed the efficient LUT distillation strategy instead of traditional quantization LUT methods. By integrating the performance of the multi-modal fusion network (MM-Net) into the MM-LUT model, our method achieves significant breakthroughs in efficiency and performance. It typically requires less than one-tenth of the time compared to the current lightweight SOTA fusion algorithms, ensuring high operational speed across various scenarios, even in low-power mobile devices. Extensive experiments validate the superiority, reliability, and stability of our fusion approach. The code will be made publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#430
LiON-LoRA: Rethinking LoRA Fusion to Unify Controllable Spatial and Temporal Generation for Video Diffusion

Yisu Zhang · Chenjie Cao · Chaohui Yu · Jianke Zhu

Video Diffusion Models (VDMs) have demonstrated remarkable capabilities in synthesizing realistic videos by learning from large-scale data. Although vanilla Low-Rank Adaptation (LoRA) can learn specific spatial or temporal movement to driven VDMs with constrained data, achieving precise control over both camera trajectories and object motion remains challenging due to the unstable fusion and non-linear scalability. To address these issues, we propose LiON-LoRA, a novel framework that rethinks LoRA fusion through three core principles: Linear scalability, Orthogonality, and Norm consistency. First, we analyze the orthogonality of LoRA features in shallow VDM layers, enabling decoupled low-level controllability. Second, norm consistency is enforced across layers to stabilize fusion during complex camera motion combinations. Third, a controllable token is integrated into the diffusion transformer (DiT) to linearly adjust motion amplitudes for both cameras and objects with a modified self-attention mechanism to ensure decoupled control. Additionally, we extend LiON-LoRA to temporal generation by leveraging static-camera videos, unifying spatial and temporal controllability. Experiments demonstrate that LiON-LoRA outperforms state-of-the-art methods in trajectory control accuracy and motion strength adjustment, achieving superior generalization with minimal training data.

Wed 22 Oct. 14:15 - 16:15 PDT

#431
Morph: A Motion-free Physics Optimization Framework for Human Motion Generation

Zhuo Li · Mingshuang Luo · RuiBing Hou · XIN ZHAO · Hao Liu · Hong Chang · Zimo Liu · Chen Li

Human motion generation has been widely studied due to its crucial role in areas such as digital humans and humanoid robot control. However, many current motion generation approaches disregard physics constraints, frequently resulting in physically implausible motions with pronounced artifacts such as floating and foot sliding. Meanwhile, training an effective motion physics optimizer with noisy motion data remains largely unexplored.In this paper, we propose Morph, a Motion-Free physics optimization framework, consisting of a Motion Generator and a Motion Physics Refinement module, for enhancing physical plausibility without relying on expensive real-world motion data. Specifically, the motion generator is responsible for providing large-scale synthetic, noisy motion data, while the motion physics refinement module utilizes these synthetic data to learn a motion imitator within a physics simulator, enforcing physical constraints to project the noisy motions into a physically-plausible space. Additionally, we introduce a prior reward module to enhance the stability of the physics optimization process and generate smoother and more stable motions. These physically refined motions are then used to fine-tune the motion generator, further enhancing its capability. This collaborative training paradigm enables mutual enhancement between the motion generator and the motion physics refinement module, significantly improving practicality and robustness in real-world applications.Experiments on both text-to-motion and music-to-dance generation tasks demonstrate that our framework achieves state-of-the-art motion quality while improving physical plausibility drastically.

Wed 22 Oct. 14:15 - 16:15 PDT

#432
RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation

Kaidong Zhang · Rongtao Xu · Ren Pengzhen · Junfan Lin · Hefeng Wu · Liang Lin · Xiaodan Liang

Operating robots in open-ended scenarios with diverse tasks is a crucial research and application direction in robotics. While recent progress in natural language processing and large multimodal models has enhanced robots' ability to understand complex instructions, robot manipulation still faces the procedural skill dilemma and the declarative skill dilemma in open environments. Existing methods often compromise cognitive and executive capabilities. To address these challenges, in this paper, we propose RoBridge, a hierarchical intelligent architecture for general robotic manipulation. It consists of a high-level cognitive planner (HCP) based on a large-scale pre-trained vision-language model (VLM), an invariant operable representation (IOR) serving as a symbolic bridge, and a generalist embodied agent (GEA). RoBridge maintains the declarative skill of VLM and unleashes the procedural skill of reinforcement learning, effectively bridging the gap between cognition and execution. RoBridge demonstrates significant performance improvements over existing baselines, achieving a 75% success rate on new tasks and an 83% average success rate in sim-to-real generalization using only five real-world data samples per task. This work represents a significant step towards integrating cognitive reasoning with physical execution in robotic systems, offering a new paradigm for general robotic manipulation.

Wed 22 Oct. 14:15 - 16:15 PDT

#433
Highlight
DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

Jungbin Cho · Junwan Kim · Jisoo Kim · Minseo Kim · Mingu Kang · Sungeun Hong · Tae-Hyun Oh · Youngjae Yu

Human motion is inherently continuous and dynamic, posing significant challenges for generative models. While discrete generation methods are widely used, they suffer from limited expressiveness and frame-wise noise artifacts. In contrast, continuous approaches produce smoother, more natural motion but often struggle to adhere to conditioning signals due to high-dimensional complexity and limited training data. To resolve this discord between discrete and continuous representations we introduce DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a novel method that leverages rectified flow to decode discrete motion tokens in the continuous, raw motion space. Our core idea is to frame token decoding as a conditional generation task, ensuring that DisCoRD captures fine-grained dynamics and achieves smoother, more natural motions. Compatible with any discrete-based framework, our method enhances naturalness without compromising faithfulness to the conditioning signals on diverse settings. Extensive evaluations demonstrate that DisCoRD achieves state-of-the-art performance, with FID of 0.032 on HumanML3D and 0.169 on KIT-ML. These results establish DisCoRD as a robust solution for bridging the divide between discrete efficiency and continuous realism. Code and checkpoints will be released.

Wed 22 Oct. 14:15 - 16:15 PDT

#434
MixANT: Observation-dependent Memory Propagation for Stochastic Dense Action Anticipation

Syed Talal Wasim · Hamid Suleman · Olga Zatsarynna · Muzammal Naseer · Juergen Gall

We present MixANT, a novel architecture for stochastic long-term dense anticipation of human activities. While recent State Space Models (SSMs) like Mamba have shown promise through input-dependent selectivity on three key parameters, the critical forget-gate ($\textbf{A}$ matrix) controlling temporal memory remains static. We address this limitation by introducing a mixture of experts approach that dynamically selects contextually relevant $\textbf{A}$ matrices based on input features, enhancing representational capacity without sacrificing computational efficiency. Extensive experiments on the 50Salads, Breakfast, and Assembly101 datasets demonstrate that MixANT consistently outperforms state-of-the-art methods across all evaluation settings. Our results highlight the importance of input-dependent forget-gate mechanisms for reliable prediction of human behavior in diverse real-world scenarios.

Wed 22 Oct. 14:15 - 16:15 PDT

#435
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

Kim Sung-Bin · Jeongsoo Choi · Puyuan Peng · Joon Chung Chung · Tae-Hyun Oh · David Harwath

We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video features, ensuring that synthesized speech is time-synchronized and expressively aligned with facial movements while preserving natural prosody. To inject visual cues, we design adapters to align facial features with the NCLM token space and introduce audio-visual fusion layers to merge audio-visual information within the NCLM framework. Additionally, we curate CelebV-Dub, a new dataset of expressive, real-world videos specifically designed for automated video dubbing. Extensive experiments show that our model achieves high-quality, intelligible, and natural speech synthesis with accurate lip synchronization, outperforming existing methods in human perception and performing favorably in objective evaluations. We also adapt VoiceCraft-Dub for the video-to-speech task, demonstrating its versatility for various applications.

Wed 22 Oct. 14:15 - 16:15 PDT

#436
DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding

Thomas Kreutz · Max Mühlhäuser · Alejandro Sanchez Guinea

Despite LiDAR (Light Detection and Ranging) being an effective privacy-preserving alternative to RGB cameras to perceive human activities, it remains largely underexplored in the context of multi-modal contrastive pre-training for human activity understanding (e.g., human activity recognition (HAR), retrieval, or person re-identification (RE-ID)). To close this gap, our work explores learning the correspondence between LiDAR point clouds, human skeleton poses, IMU data, and text in a joint embedding space. More specifically, we present DeSPITE, a \underline{\textbf{D}e}ep \underline{\textbf{S}}keleton-\underline{\textbf{P}}ointcloud-\underline{\textbf{I}}MU-\underline{\textbf{T}}ext \underline{\textbf{E}}mbedding model, which effectively learns a joint embedding space across these four modalities through noise contrastive estimation. At the heart of our empirical exploration, we have combined the existing LIPD and Babel datasets, which enabled us to synchronize data of all four modalities, allowing us to explore the learning of a new joint embedding space. Our experiments demonstrate novel human activity understanding tasks for point cloud sequences enabled through DeSPITE, including Skeleton$\leftrightarrow$Pointcloud$\leftrightarrow$IMU matching, retrieval, and temporal moment retrieval. Furthermore, we show that DeSPITE is an effective pre-training strategy for point cloud HAR through experiments in MSR-Action3D and HMPEAR.

Wed 22 Oct. 14:15 - 16:15 PDT

#437
Temporal Rate Reduction Clustering for Human Motion Segmentation

Xianghan Meng · Zhengyu Tong · Zhiyuan Huang · Chun-Guang Li

Human Motion Segmentation (HMS), which aims to partition videos into non-overlapping human motions, has attracted increasing research attention recently. Existing approaches for HMS are mainly dominated by subspace clustering methods, which are grounded on the assumption that high-dimensional temporal data align with a Union-of-Subspaces (UoS) distribution.However, the frames in video capturing complex human motions with cluttered backgrounds may not align well with the UoS distribution. In this paper, we propose a novel approach for HMS, named Temporal Rate Reduction Clustering ($\text{TR}^2\text{C}$), which jointly learns structured representations and affinity to segment the frame sequences in video.Specifically, the structured representations learned by $\text{TR}^2\text{C}$ maintain temporally consistent and align well with a UoS structure, which is favorable for the HMS task.We conduct extensive experiments on five benchmark HMS datasets and achieve state-of-the-art performances with different feature extractors.

Wed 22 Oct. 14:15 - 16:15 PDT

#438
DISTA-Net: Dynamic Closely-Spaced Infrared Small Target Unmixing

Shengdong Han · Shangdong Yang · Yuxuan Li · Xin Zhang · Xiang Li · jian Yang · Ming-Ming Cheng · Yimian Dai

Resolving closely-spaced small targets in dense clusters presents a significant challenge in infrared imaging, as the overlapping signals hinder precise determination of their quantity, sub-pixel positions, and radiation intensities. While deep learning has advanced the field of infrared small target detection, its application to closely-spaced infrared small targets has not yet been explored. This gap exists primarily due to the complexity of separating superimposed characteristics and the lack of an open-source infrastructure.In this work, we propose the Dynamic Iterative Shrinkage Thresholding Network (DISTA-Net), which reconceptualizes traditional sparse reconstruction within a dynamic framework.DISTA-Net adaptively generates convolution weights and thresholding parameters to tailor the reconstruction process in real time.To the best of our knowledge, DISTA-Net is the first deep learning model designed specifically for the unmixing of closely-spaced infrared small targets, achieving superior sub-pixel detection accuracy.Moreover, we have established the first open-source ecosystem to foster further research in this field. This ecosystem comprises three key components: (1) CSIST-100K, a publicly available benchmark dataset; (2) CSO-mAP, a custom evaluation metric for sub-pixel detection; and (3) GrokCSO, an open-source toolkit featuring DISTA-Net and other models, will be publicly available soon.

Wed 22 Oct. 14:15 - 16:15 PDT

#439
Efficient Concertormer for Image Deblurring and Beyond

Pin-Hung Kuo · Jinshan Pan · Shao-Yi Chien · Ming-Hsuan Yang

The Transformer architecture has excelled in NLP and vision tasks, but its self-attention complexity grows quadratically with image size, making high-resolution tasks computationally expensive. We introduce {\ours}, featuring Concerto Self-Attention (CSA) for image deblurring. CSA splits self-attention into global and local components while retaining partial information in additional dimensions, achieving linear complexity. A Cross-Dimensional Communication module enhances expressiveness by linearly combining attention maps. Additionally, our gated-dconv MLP merges the two-staged Transformer design into a single stage. Extensive evaluations show our method performs favorably against state-of-the-art works in deblurring, deraining, and JPEG artifact removal. Code and models will be publicly available.

We propose ASCENT, a novel framework for tracking neurons in 3D fluorescence microscopy recordings without relying on manual track annotations. ASCENT leverages self-supervised contrastive learning to learn robust, discriminative embeddings from detected neuron candidates. At its core is a volume compression module that transforms full 3D volumetric data into an efficient 2D representation by iteratively projecting along the z-axis and integrating positional information. This compressed representation is processed by a deep encoder (e.g., ResNet or Vision Transformer) to yield robust feature vectors that capture both appearance and spatial relationships among neurons. Extensive experiments on both in-house and public datasets demonstrate that ASCENT achieves state-of-the-art tracking performance with fast inference speed while removing the need for costly manual labeling and heavy pre- and post-processing. Our results suggest that this approach provides a scalable solution for 3D neuron tracking and holds promise for applications such as inter-individual neuron identity matching and demixing overlapping cells.

Wed 22 Oct. 14:15 - 16:15 PDT

#441
InfiniDreamer: Arbitrarily Long Human Motion Generation via Segment Score Distillation

Wenjie Zhuo · Fan Ma · Hehe Fan

We introduce InfiniDreamer, a novel framework for arbitrarily long human motion generation. Existing motion generation methods are often constrained to short sequences due to the lack of long motion training data. To overcome this, InfiniDreamer first generates sub-motions corresponding to each textual description and assembles them into a coarse long sequence using randomly initialized transition segments. To refine the entire motion, we propose Segment Score Distillation (SSD)—an optimization-based method that leverages a motion prior trained solely on short clips, enabling long-sequence generation without additional training. Specifically, SSD iteratively refines overlapping short segments sampled from the coarsely extended long motion sequence, progressively aligning them with the pre-trained motion diffusion prior. This process ensures local coherence within each segment, while the refined transitions between segments maintain global consistency across the entire sequence. Extensive qualitative and quantitative experiments validate the superiority of our framework, showcasing its ability to generate coherent, contextually aware motion sequences of arbitrary length.

Wed 22 Oct. 14:15 - 16:15 PDT

#442
FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Taekyung Ki · Dongchan Min · Gyeongsu Chae

With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. Instead of a pixel-based latent space, we take advantage of a learned orthogonal motion latent space, enabling efficient generation and editing of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with an effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.

Wed 22 Oct. 14:15 - 16:15 PDT

#443
VSRM: A Robust Mamba-Based Framework for Video Super-Resolution

Phu Tran Dinh · Hung Dao · Daeyoung Kim

Video super-resolution remains a significant challenge in low-level vision tasks. To date, CNN- and Transformer-based methods have delivered impressive results. However, CNNs are limited by local receptive fields, while Transformers struggle with quadratic complexity, posing challenges for processing long sequences in VSR. Recently, Mamba has drawn attention for its long-sequence modeling, linear complexity, and large receptive fields. In this work, we propose VSRM, a novel Video Super-Resolution framework that leverages the power of Mamba. VSRM introduces Spatial-to-Temporal Mamba and Temporal-to-Spatial Mamba blocks to extract long-range spatio-temporal features and enhance receptive fields efficiently. To better align adjacent frames, we propose Deformable Cross-Mamba Alignment module. This module utilizes a deformable cross-mamba mechanism to make the compensation stage more dynamic and flexible, preventing feature distortions. Finally, we minimize the frequency domain gaps between reconstructed and ground-truth frames by proposing a simple yet effective Frequency Charbonnier-like loss that better preserves high-frequency content and enhances visual quality. Through extensive experiments, VSRM achieves state-of-the-art results on diverse benchmarks, establishing itself as a solid foundation for future research.

Wed 22 Oct. 14:15 - 16:15 PDT

#444
Face Retouching with Diffusion Data Generation and Spectral Restorement

Zhidan Xu · Xiaoqin Zhang · Shijian Lu

Face retouching has achieved impressive performance largely driven by its wide range of applications in various real-world tasks. However, most existing works often encounters a dilemma between global consistency and local detail preservation, partially due to the lack of large-scale and high-quality training data. We address the face retouching challenge from two perspectives. First, we create a large-scale face retouching benchmark to mitigate the data scarcity issue. The benchmark comprises 25,000 pairs of high-quality facial images (before and after face retouching) that contain a variety of facial attributes and blemish types such as acne and moles. Second, we design a novel framework that introduces frequency selection and restoration (FSR) and multi-resolution fusion (MRF) that leverages frequency-aware dynamic aggregation and spatial-frequency filtering to achieve global consistency and local detail preservation concurrently. Inspired by the principle of JPEG compression, FSR introduces frequency-domain quantization with spatial projections to learn enhanced feature representations. MRF fuses multi-resolution features via laplacian pyramid fusion, removing large-area blemishes and preserving local fine details effectively. Extensive experiments over multiple benchmarks show that the proposed framework outperforms the state-of-the-art quantitatively and qualitatively. The created benchmark also provides valuable data for training and evaluating both existing and future face retouching networks.

Wed 22 Oct. 14:15 - 16:15 PDT

#445
Separation for Better Integration: Disentangling Edge and Motion in Event-based Deblurring

Yufei Zhu · Hao Chen · Yongjian Deng · Wei You

Traditional motion deblurring methods struggle to effectively model motion information within the exposure time. Recently, event cameras have attracted significant research interest for its ability to model motion cues over the exposure duration. However, these methods directly fuse event features with image, overlooking the intrinsic heterogeneity of events. In this paper, we identify that the event modality contains two conflicting types of information: edge features and motion cues. Events accumulated over a short exposure period capture sharp edge details but lose motion information, while those accumulated over a long exposure period blur edge details due to motion. To address this issue, we propose a simple yet effective approach to disentangle these two cues from event features and employ an edge-aware sharpening module along with motion-driven scale-adaptive deblurring module to fully leverage both. Specifically, the first module aids in restoring sharp edges by leveraging the clear edge features provided by events, while the second module leverages motion cues to learn diverse blur kernels, adaptively adjusting the receptive field for optimal deblurring. Extensive experiments on synthetic and real-world datasets validate the effectiveness of our approach and yield a substantial improvement over state-of-the-art single-frame methods and surpasses most multi-frame-based methods. Code will be publicly available.

Wed 22 Oct. 14:15 - 16:15 PDT

#446
2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos

Marvin Heidinger · Snehal Jauhri · Vignesh Prasad · Georgia Chalvatzaki

When interacting with objects, humans effectively reason about which regions of objects are viable for an intended action, i.e., the affordance regions of the object. They can also account for subtle differences in object regions based on the task to be performed and whether one or two hands need to be used. However, current vision-based affordance prediction methods often reduce the problem to naive object part segmentation. In this work, we propose a framework for extracting affordance data from human activity video datasets. Our extracted 2HANDS dataset contains precise object affordance region segmentations and affordance class-labels as narrations of the activity performed. The data also accounts for bimanual actions, i.e., two hands co-ordinating and interacting with one or more objects. We present a VLM-based affordance prediction model, 2HandedAfforder, trained on the dataset and demonstrate superior performance over baselines in affordance region segmentation for various activities. Finally, we show that our predicted affordance regions are actionable, i.e., can be used by an agent performing a task, through demonstration in robotic manipulation scenarios.

Wed 22 Oct. 14:15 - 16:15 PDT

#447
Flow Stochastic Segmentation Networks

Fabio De Sousa Ribeiro · Omar Todd · Charles Jones · Avinash Kori · Raghav Mehta · Ben Glocker

We propose the Flow Stochastic Segmentation Network (Flow-SSN), a generative model for probabilistic segmentation featuring discrete-time autoregressive and modern continuous-time flow parameterisations. We prove fundamental limitations of the low-rank parameterisation of previous methods and show that Flow-SSNs can estimate arbitrarily high-rank pixel-wise covariances without assuming the rank or storing the distributional parameters. Flow-SSNs are also more efficient to sample from than standard diffusion-based segmentation models, as most of the model capacity is allocated to learning the base distribution of the flow, which constitutes an expressive prior. We apply Flow-SSNs to challenging medical imaging benchmarks and achieve state-of-the-art results.

Despite modifying only a small localized input region, adversarial patches can drastically change the prediction of computer vision models. However, prior methods either cannot perform satisfactorily under targeted attack scenarios or fail to produce contextually coherent adversarial patches, causing them to be easily noticeable by human examiners and insufficiently stealthy against automatic patch defenses. In this paper, we introduce IAP, a novel attack framework that generates highly invisible adversarial patches based on perceptibility-aware localization and perturbation optimization schemes. Specifically, IAP first searches for a proper location to place the patch by leveraging classwise localization and sensitivity maps, balancing the susceptibility of patch location to both victim model prediction and human visual system, then employs a perceptibility-regularized adversarial loss and a gradient update rule that prioritizes color constancy for optimizing invisible perturbations. Comprehensive experiments across various image benchmarks and model architectures demonstrate that IAP consistently achieves competitive attack success rates in targeted settings with significantly improved patch invisibility compared to existing baselines. In addition to being highly imperceptible to humans, IAP is shown to be stealthy enough to render several state-of-the-art patch defenses ineffective.

Wed 22 Oct. 14:15 - 16:15 PDT

#449
Towards a Unified Copernicus Foundation Model for Earth Vision

Yi Wang · Zhitong Xiong · Chenying Liu · Adam Stewart · Thomas Dujardin · Nikolaos Ioannis Bountos · Angelos Zavras · Franziska Gerken · Ioannis Papoutsis · Laura Leal-Taixé · Xiao Xiang Zhu

Advances in Earth observation (EO) foundation models have unlocked the potential of big satellite data to learn generic representations from space, benefiting a wide range of downstream applications crucial to our planet. However, most existing efforts remain limited to fixed spectral sensors, focus solely on the Earth's surface, and overlook valuable metadata beyond imagery. In this work, we take a step towards next-generation EO foundation models with three key components: 1) Copernicus-Pretrain, a massive-scale pretraining dataset that integrates 18.7M aligned images from all major Copernicus Sentinel missions, spanning from the Earth's surface to its atmosphere; 2) Copernicus-FM, a unified foundation model capable of processing any spectral or non-spectral sensor modality using extended dynamic hypernetworks and flexible metadata encoding; and 3) Copernicus-Bench, a systematic evaluation benchmark with 15 hierarchical downstream tasks ranging from preprocessing to specialized applications for each Sentinel mission. Our dataset, model, and benchmark greatly improve the scalability, versatility, and multimodal adaptability of EO foundation models, while also creating new opportunities to connect EO, weather, and climate research.

Wed 22 Oct. 14:15 - 16:15 PDT

#450
Teeth Reconstruction and Performance Capture Using a Phone Camera

Weixi Zheng · Jingwang Ling · Zhibo Wang · Quan Wang · Feng Xu

We present the first method for personalized dental shape reconstruction and teeth-inclusive facial performance capture using only a single phone camera. Our approach democratizes high-quality facial avatars through a non-invasive, low-cost setup by addressing the ill-posed monocular capture problem with an analysis-by-synthesis approach. We introduce a representation adaptation technique that maintains both mesh and SDF representations of teeth, enabling efficient differentiable rendering while preventing teeth-lip interpenetration. To overcome alignment challenges with similar-appearing dental components, we leverage foundation models for semantic teeth segmentation and design specialized optimization objectives. Our method addresses the challenging occlusions of teeth during facial performance through optimization strategies that leverage facial structural priors, while our semantic mask rendering loss with optimal transport-based matching ensures convergence despite significant variations in initial positioning. Code will be released.