Skip to yearly menu bar Skip to main content


Timezone: Pacific/Honolulu

Registration Desk: Registration/Badge Pickup Wed 22 Oct 07:30 a.m.  


Oral 3A: Foundation models and representation learning Wed 22 Oct 08:00 a.m.  

Oral
Huiyang Hu · Peijin Wang · Hanbo Bi · Boyuan Tong · Zhaozhi Wang · Wenhui Diao · Hao Chang · Yingchao Feng · Ziqi Zhang · Yaowei Wang · Qixiang Ye · Kun Fu · Xian Sun

[ Exhibit Hall III ]

Abstract
Remote sensing foundation models largely break away from the traditional paradigm of designing task-specific models, offering greater scalability across multiple tasks. However, they face challenges such as low computational efficiency and limited interpretability, especially when dealing with large-scale remote sensing images. To overcome these, we draw inspiration from heat conduction, a physical process modeling local heat diffusion. Building on this idea, we are the first to explore the potential of using the parallel computing model of heat conduction to simulate the local region correlations in high-resolution remote sensing images, and introduce RS-vHeat, an efficient multi-modal remote sensing foundation model. Specifically, RS-vHeat 1) applies the Heat Conduction Operator (HCO) with a complexity of $O(N^{1.5})$ and a global receptive field, reducing computational overhead while capturing remote sensing object structure information to guide heat diffusion; 2) learns the frequency distribution representations of various scenes through a self-supervised strategy based on frequency domain hierarchical masking and multi-domain reconstruction; 3) significantly improves efficiency and performance over state-of-the-art techniques across 4 tasks and 10 datasets. Compared to attention-based remote sensing foundation models, we reduce memory usage by 84\%, FLOPs by 24\% and improves throughput by 2.7 times. The code will be made publicly available.
Oral
Yi Wang · Zhitong Xiong · Chenying Liu · Adam Stewart · Thomas Dujardin · Nikolaos Ioannis Bountos · Angelos Zavras · Franziska Gerken · Ioannis Papoutsis · Laura Leal-Taixé · Xiao Xiang Zhu

[ Exhibit Hall III ]

Abstract
Advances in Earth observation (EO) foundation models have unlocked the potential of big satellite data to learn generic representations from space, benefiting a wide range of downstream applications crucial to our planet. However, most existing efforts remain limited to fixed spectral sensors, focus solely on the Earth's surface, and overlook valuable metadata beyond imagery. In this work, we take a step towards next-generation EO foundation models with three key components: 1) Copernicus-Pretrain, a massive-scale pretraining dataset that integrates 18.7M aligned images from all major Copernicus Sentinel missions, spanning from the Earth's surface to its atmosphere; 2) Copernicus-FM, a unified foundation model capable of processing any spectral or non-spectral sensor modality using extended dynamic hypernetworks and flexible metadata encoding; and 3) Copernicus-Bench, a systematic evaluation benchmark with 15 hierarchical downstream tasks ranging from preprocessing to specialized applications for each Sentinel mission. Our dataset, model, and benchmark greatly improve the scalability, versatility, and multimodal adaptability of EO foundation models, while also creating new opportunities to connect EO, weather, and climate research.
Oral
Yibin Yan · Jilan Xu · Shangzhe Di · Yikun Liu · Yudi Shi · Qirui Chen · Zeqian Li · Yifei Huang · Weidi Xie

[ Exhibit Hall III ]

Abstract
Understanding continuous video streams plays a fundamental role in real-time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency decisions.To address these challenges, our main contributions are three-fold. (i) We develop a novel streaming video backbone, termed as **StreamFormer**, by incorporating causal temporal attention into a pre-trained vision transformer. This enables efficient streaming video processing while maintaining image representation capability. (ii) To train StreamFormer, we propose to unify diverse spatial-temporal video understanding tasks within a multitask visual-language alignment framework. Hence, StreamFormer learns global semantics, temporal dynamics, and fine-grained spatial relationships simultaneously. (iii) We conduct extensive experiments on online action detection, online video instance segmentation, and video question answering. StreamFormer achieves competitive results while maintaining efficiency, demonstrating its potential for real-time applications.
Oral
Haiwen Huang · Anpei Chen · Volodymyr Havrylov · Andreas Geiger · Dan Zhang

[ Exhibit Hall III ]

Abstract
Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks.
Oral
Ziwei Wang · Sameera Ramasinghe · Chenchen Xu · Julien Monteil · Loris Bazzani · Thalaiyasingam Ajanthan

[ Exhibit Hall III ]

Abstract
Structuring latent representations in a hierarchical manner enables models to learn patterns at multiple levels of abstraction. However, most prevalent image understanding models focus on visual similarity, and learning visual hierarchies is relatively unexplored. In this work, for the first time, we introduce a learning paradigm that can encode user-defined multi-level complex visual hierarchies in hyperbolic space without requiring explicit hierarchical labels. As a concrete example, first, we define a part-based image hierarchy using object-level annotations within and across images. Then, we introduce an approach to enforce the hierarchy using contrastive loss with pairwise entailment metrics. Finally, we discuss new evaluation metrics to effectively measure hierarchical image retrieval. Encoding these complex relationships ensures that the learned representations capture semantic and structural information that transcends mere visual similarity. Experiments in part-based image retrieval show significant improvements in hierarchical retrieval tasks, demonstrating the capability of our model in capturing visual hierarchies.
Oral
Tingting Zheng · Hongxun Yao · Kui Jiang · Yi Xiao · Sicheng Zhao

[ Exhibit Hall III ]

Abstract
Recent advances in selective state space models (Mamba) have shown great promise in whole slide image (WSI) classification. Despite this, WSIs contain explicit local redundancy (similar patches) and irrelevant regions (uninformative instances), posing significant challenges for Mamba-based multi-instance learning (MIL) methods in capturing global representations. Furthermore, bag-level approaches struggle to extract critical features from all instances, while group-level methods fail to adequately account for tumor dispersion and intrinsic correlations across groups, leading to suboptimal global representations. To address these issues, we propose group masking Mamba (GMMamba), a novel framework that combines two elaborate modules: (1) intra-group masking Mamba (IMM) for selective instance exploration within groups, and (2) cross-group super-feature sampling (CSS) to ameliorate long-range relation learning. Specifically, IMM adaptively predicts sparse masks to filter out features with low attention scores (i.e., uninformative patterns) during bidirectional Mamba modeling, facilitating the removal of instance redundancies for compact local representation. For improved bag prediction, the CSS module further aggregates sparse group representations into discriminative features, effectively grasping comprehensive dependencies among dispersed and sparse tumor regions inherent in large-scale WSIs. Extensive experiments on four datasets demonstrate that GMMamba outperforms the state-of-the-art ACMIL by 2.2\% and 6.4\% in accuracy on the TCGA-BRCA and TCGA-ESCA datasets, …

Oral 3B: Human Modeling Wed 22 Oct 08:00 a.m.  

Oral
Tianyi Wang · Shuaicheng Niu · Harry Cheng · xiao zhang · Yinglong Wang

[ Kalakaua Ballroom ]

Abstract
Suffering from performance bottlenecks in passively detecting high-quality Deepfake images due to the advancement of generative models, proactive perturbations offer a promising approach to disabling Deepfake manipulations by inserting signals into benign images. However, existing proactive perturbation approaches remain unsatisfactory in several aspects: 1) visual degradation due to direct element-wise addition; 2) limited effectiveness against face swapping manipulation; 3) unavoidable reliance on white- and grey-box settings to involve generative models during training. In this study, we analyze the essence of Deepfake face swapping and argue the necessity of protecting source identities rather than target images, and we propose NullSwap, a novel proactive defense approach that cloaks source image identities and nullifies face swapping under a pure black-box scenario. We design an Identity Extraction module to obtain facial identity features from the source image, while a Perturbation Block is then devised to generate identity-guided perturbations accordingly. Meanwhile, a Feature Block extracts shallow-level image features, which are then fused with the perturbation in the Cloaking Block for image reconstruction. Furthermore, to ensure adaptability across different identity extractors in face swapping algorithms, we propose Dynamic Loss Weighting to adaptively balance identity losses. Experiments demonstrate the outstanding ability of our approach to fool various …
Oral
Ekkasit Pinyoanuntapong · Muhammad Usama Saleem · Korrawe Karunratanakul · Pu Wang · Hongfei Xue · Chen Chen · chuan guo · Junli Cao · Jian Ren · Sergey Tulyakov

[ Kalakaua Ballroom ]

Abstract
Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, these models struggle to achieve high-precision control while maintaining high-quality motion generation. To address these challenges, we propose MaskControl, the first approach to introduce controllability to the generative masked motion model. Our approach introduces two key innovations. First, \textit{Logits Regularizer} implicitly perturbs logits at training time to align the distribution of motion tokens with the controlled joint positions, while regularizing the categorical token prediction to ensure high-fidelity generation. Second, \textit{Logit Optimization} explicitly optimizes the predicted logits during inference time, directly reshaping the token distribution that forces the generated motion to accurately align with the controlled joint positions. Moreover, we introduce \textit{Differentiable Expectation Sampling (DES)} to combat the non-differential distribution sampling process encountered by logits regularizer and optimization. Extensive experiments demonstrate that MaskControl outperforms state-of-the-art methods, achieving superior motion quality (FID decreases by ~77\%) and higher control precision (average error 0.91 vs. 1.08). Additionally, MaskControl enables diverse applications, including any-joint-any-frame control, body-part timeline control, and zero-shot objective control. Video visualization can be found at \url{https://anonymous-ai-agent.github.io/CAM}
Oral
Byungjun Kim · Shunsuke Saito · Giljoo Nam · Tomas Simon · Jason Saragih · Hanbyul Joo · Junxuan Li

[ Kalakaua Ballroom ]

Abstract
We present a universal prior model for 3D head avatar with hair compositionality. Existing approaches for building generalizable prior for 3D head avatar often model face and hair in a monolithic manner, where the inherent compositonality of the human head and hair is not considered. It is especially challenging for the monolithic model to self-discover the compositionality of face and hair when the dataset is not large enough. Moreover, extending the monolithic model for applications like swapping faces or hairstyles in 3D is not straightforward. Our prior model explicitly accounts for the compositionality of face and hair, learning their priors separately. To learn a disentangled latent spaces of face and hair of 3D head avatars, we propose a synthetic hairless data creation pipeline for dehairing the studio-captured dataset with estimated hairless geometry and hairless texture obtained from diffusion prior. Using a paired dataset of hair and hairless captures, disentangled prior models for face and hair can be trained by leveraging compositionality as an inductive bias to achieve disentanglement. Our model's inherent compositionality enables a seamless transfer of face and hair components between avatars while maintaining the subject's identity. Furthermore, we demonstrate that our model can be finetuned with a monocular …
Oral
Sindhu Hegde · K R Prajwal · Taein Kwon · Andrew Zisserman

[ Kalakaua Ballroom ]

Abstract
Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations: (i) gesture-based retrieval, (ii) gestured word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs), across all three tasks. Further analysis reveals that speech and text modalities capture distinct gesture-related signals, underscoring the advantages of learning a shared tri-modal embedding space. All code, models, and data annotations will be released to support future research.
Oral
Junzhe Lu · Jing Lin · Hongkun Dou · Ailing Zeng · Yue Deng · Xian Liu · Zhongang Cai · Lei Yang · YULUN ZHANG · Haoqian Wang · Ziwei Liu

[ Kalakaua Ballroom ]

Abstract
We present DPoser-X, a diffusion-based prior model for 3D whole-body human poses. Building a versatile and robust full-body human pose prior remains challenging due to the inherent complexity of articulated human poses and the scarcity of high-quality whole-body pose datasets. To address these limitations, we introduce a Diffusion model as body Pose prior (DPoser) and extend it to DPoser-X for expressive whole-body human pose modeling.Our approach unifies various pose-centric tasks as inverse problems, solving them through variational diffusion sampling. To enhance performance on downstream applications, we introduce a novel truncated timestep scheduling method specifically designed for pose data characteristics. We also propose a masked training mechanism that effectively combines whole-body and part-specific datasets, enabling our model to capture interdependencies between body parts while avoiding overfitting to specific actions.Extensive experiments demonstrate DPoser-X's robustness and versatility across multiple benchmarks for body, hand, face, and full-body pose modeling. Our model consistently outperforms state-of-the-art alternatives, establishing a new benchmark for whole-body human pose prior modeling.
Oral
Weixi Zheng · Jingwang Ling · Zhibo Wang · Quan Wang · Feng Xu

[ Kalakaua Ballroom ]

Abstract
We present the first method for personalized dental shape reconstruction and teeth-inclusive facial performance capture using only a single phone camera. Our approach democratizes high-quality facial avatars through a non-invasive, low-cost setup by addressing the ill-posed monocular capture problem with an analysis-by-synthesis approach. We introduce a representation adaptation technique that maintains both mesh and SDF representations of teeth, enabling efficient differentiable rendering while preventing teeth-lip interpenetration. To overcome alignment challenges with similar-appearing dental components, we leverage foundation models for semantic teeth segmentation and design specialized optimization objectives. Our method addresses the challenging occlusions of teeth during facial performance through optimization strategies that leverage facial structural priors, while our semantic mask rendering loss with optimal transport-based matching ensures convergence despite significant variations in initial positioning. Code will be released.

Invited Talk: Brent Seales

On Perseverance: Virtually Unwrapping the Herculaneum Scrolls

This talk tells the story of virtual unwrapping, conceived during the rise of digital libraries, computer vision, and large-scale computing, and now realized on some of the most difficult and iconic material in the world - the Herculaneum Scrolls - as a result of the recent phenomena of big data and machine learning. Virtual unwrapping is a non-invasive restoration pathway for damaged written material, allowing texts to be read from objects that are too damaged even to be opened. The Herculaneum papyrus scrolls, buried and carbonized by the eruption of Mount Vesuvius in 79 CE and then excavated in the 18th century, are original, classical texts from the shelves of the only library to have survived from antiquity. The 250-year history of science and technology applied to the challenge of opening and then reading them has created a fragmentary, damaged window into their literary and philosophical secrets. In 1999, with more than 400 scrolls still unopened, methods for physical unwrapping were permanently halted. The intact scrolls present an enigmatic challenge: preserved by the fury of Vesuvius, yet still lost. Using a non-invasive imaging approach, we have now shown how to recover their texts, rendering them "unlost." The path we have forged uses high energy physics, artificial intelligence, and the collective power of a global, scientific community inspired by prizes, collaborative generosity, and the common goal of shared glory: reading original classical texts for the first time in 2000 years.

Brent Seales

 

Dr. W. Brent Seales is the Stanley and Karen Pigman Chair of Heritage Science and Professor of Computer Science at the University of Kentucky. He earned a Ph.D. in Computer Science at the University of Wisconsin-Madison and has held research positions at INRIA SophiaAntipolis, UNC Chapel Hill, Google (Paris), and the Getty Conservation Institute. The Heritage Science research lab (EduceLab) founded by Seales at the University of Kentucky applies techniques in machine learning and data science to the digital restoration of damaged materials. The research program is funded by the National Science Foundation, the National Endowment for the Humanities, the Arts and Humanities Research Council of Great Britain, the Andrew W. Mellon Foundation, and Google. Seales is a co-founder of the Vesuvius Challenge, an international contest formed around the goal of the virtual unwrapping of Herculaneum scrolls. He continues to work with challenging, damaged material (Herculaneum Scrolls, Dead Sea Scrolls), with notable successes in the scroll from En-Gedi (Leviticus), the Morgan MS M.910 (The Acts of the Apostles), and PHerc.Paris.3 and 4 (Philodemus / Epicureanism). The recovery of readable text from still-unopened material has been hailed worldwide as an astonishing achievement fueled by open scholarship, interdisciplinary collaboration, and extraordinary leadership generosity.



Poster Session 3 & Exhibit Hall Wed 22 Oct 10:45 a.m.  

Poster
Risa Shinoda · Nakamasa Inoue · Iro Laina · Christian Rupprecht · Hirokatsu Kataoka

[ Exhibit Hall I ]

Abstract
Wildlife observation plays an important role in biodiversity conservation, necessitating robust methodologies for monitoring wildlife populations and interspecies interactions. Recent advances in computer vision have significantly contributed to automating fundamental wildlife observation tasks, such as animal detection and species identification. However, accurately identifying species from indirect evidence like footprints and feces remains relatively underexplored, despite its importance in contributing to wildlife monitoring. To bridge this gap, we introduce AnimalClue, the first large-scale dataset for species identification from images of indirect evidence. Our dataset consists of 159,605 bounding boxes encompassing five categories of indirect clues: footprints, feces, eggs, bones, and feathers. It covers 968 species, 200 families, and 65 orders. Each image is annotated with species-level labels, bounding boxes or segmentation masks, and fine-grained trait information, including activity patterns and habitat preferences. Unlike existing datasets primarily focused on direct visual features (e.g., animal appearances), AnimalClue presents unique challenges for classification, detection, and instance segmentation tasks due to the need for recognizing more detailed and subtle visual features. In our experiments, we extensively evaluate representative vision models and identify key challenges in animal identification from their traces. We will make the dataset publicly available for research purpose.
Poster
Shih-Po Lee · Ehsan Elhamifar

[ Exhibit Hall I ]

Abstract
Understanding user actions and their possible mistakes is essential for successful operation of task assistants. In this paper, we develop a unified framework for joint temporal action segmentation and error recognition (recognizing when and which type of error happens) in procedural task videos. We propose a Generalized Task Graph (GTG) whose nodes encode correct steps and background (task-irrelevant actions). We then develop a GTG-Video Alignment algorithm (GTG2Vid) to jointly segment videos into actions and detect frames containing errors. Given that it is infeasible to gather many videos and their annotations for different types of errors, we study a framework that only requires normal (error-free) videos during training. More specifically, we leverage large language models (LLMs) to obtain error descriptions and subsequently use video-language models (VLMs) to generate visually-aligned textual features, which we use for error recognition. We then propose an Error Recognition Module (ERM) to recognize the error frames predicted by GTG2Vid using the generated error features. By extensive experiments on two egocentric datasets of EgoPER and CaptainCook4D, we show that our framework outperforms other baselines on action segmentation, error detection and recognition.
Poster
Jiahui Lei · Kyle Genova · George Kopanas · Noah Snavely · Leonidas Guibas

[ Exhibit Hall I ]

Abstract
This paper addresses the challenge of learning semantically and functionally meaningful 3D motion priors from real-world videos, in order to enable prediction of future 3D scene motion from a single input image.We propose a novel pixel-aligned Motion Map (MoMap) representation for 3D scene motion, which can be generated from existing generative image models to facilitate efficient and effective motion prediction. To learn meaningful distributions over motion, we create a large-scale database of MoMaps from over 50,000 real videos and train a diffusion model on these representations. Our motion generation not only synthesizes trajectories in 3D but also suggests a new pipeline for 2D video synthesis: first generate a MoMap in 3D, then warp an image accordingly and complete the warped point-based renderings. Experimental results demonstrate that our approach generates plausible and semantically consistent 3D scene motion.
Poster
Nan Chen · Mengqi Huang · Yihao Meng · Zhendong Mao

[ Exhibit Hall I ]

Abstract
Animation colorization is a crucial part of real animation industry production. Long animation colorization has high labor costs. Therefore, automated long animation colorization based on the video generation model has significant research value. Existing studies are limited to short-term colorization. These studies adopt a local paradigm, fusing overlapping features to achieve smooth transitions between local segments. However, the local paradigm neglects global information, failing to maintain long-term color consistency. In this study, we argue that ideal long-term color consistency can be achieved through a dynamic global-local paradigm, i.e., dynamically extracting global color consistent features relevant to the current generation. Specifically, we propose LongAnimation, a novel framework, which mainly includes a SketchDiT, a Dynamic Global-Local Memory (DGLM), and a Color Consistency Reward. The SketchDiT captures hybrid reference features to support the DGLM module. The DGLM module employs a long video understanding model to dynamically compress global historical features and adaptively fuse them with the current generation features. To refine the color consistency, we introduce a Color Consistency Reward. During inference, we propose a color consistency fusion to smooth the video segment transition. Extensive experiments on both short-term (14 frames) and long-term (average 500 frames) animations show the effectiveness of LongAnimation in …
Poster
Minsoo Kim · Min-Cheol Sagong · Gi Pyo Nam · Junghyun Cho · Ig-Jae Kim

[ Exhibit Hall I ]

Abstract
Deep learning-based face recognition continues to face challenges due to its reliance on huge datasets obtained from web crawling, which can be costly to gather and raise significant real-world privacy concerns. To address this issue, we propose VIGFace, a novel framework capable of generating synthetic facial images. Our idea originates from pre-assigning virtual identities in the feature space. Initially, we train the face recognition model using a real face dataset and create a feature space for both real and virtual identities, where virtual prototypes are orthogonal to other prototypes. Subsequently, we train the diffusion model based on the established feature space, enabling it to generate authentic human face images from real prototypes and synthesize virtual face images from virtual prototypes.Our proposed framework provides two significant benefits. Firstly, it shows clear separability between existing individuals and virtual face images, allowing one to create synthetic images with confidence and without concerns about privacy and portrait rights. Secondly, it ensures improved performance through data augmentation by incorporating real existing images. Extensive experiments demonstrate the superiority of our virtual face dataset and framework, outperforming the previous state-of-the-art on various face recognition benchmarks.
Poster
Zekun Qian · Ruize Han · Zhixiang Wang · Junhui Hou · Wei Feng

[ Exhibit Hall I ]

Abstract
Open-Vocabulary Multi-Object Tracking (OVMOT) aims to detect and track diverse object categories in videos, including both seen (base) and unseen (novel) categories. Current methods rely on appearance features from generated image pairs or utilize the discontinuous annotations of the video dataset (TAO) for training, primarily due to the lack of available continuous annotated video datasets for OVMOT. This limitation affects their effectiveness, since continuous target trajectories are necessary for robust tracker learning.In this work, we propose the C-TAO dataset, which provides a continuous version of TAO, thereby constructing the first continuous annotated training dataset for OVMOT. This addresses the previous limitations in training data availability. Additionally, we introduce COVTrack, a unified framework that effectively integrates motion and semantic features with appearance features, in which the multi-cue feature aggregation strategy dynamically aggregates and balances these features, based on the confidence estimation from both intra-frame and inter-frame contexts.Our proposed framework significantly improves OVMOT performance, establishing COVTrack as a state-of-the-art solution on OVMOT benchmarks.
Poster
Xiangyang Luo · Ye Zhu · Yunfei Liu · Lijian Lin · Cong Wan · Zijian Cai · Yu Li · Shao-Lun Huang

[ Exhibit Hall I ]

Abstract
Video face swapping aims to address two primary challenges: effectively transferring the source identity to the target video and accurately preserving the dynamic attributes of the target face, such as head poses, facial expressions, lip-sync, \etc.Existing methods mainly focus on achieving high-quality identity transfer but often fall short in maintaining the dynamic attributes of the target face, leading to inconsistent results. We attribute this issue to the inherent coupling of facial appearance and motion in videos. To address this, we propose CanonSwap, a novel video face-swapping framework that decouples motion information from appearance information. Specifically, CanonSwap first eliminates motion-related information, enabling identity modification within a unified canonical space. Subsequently, the swapped identity is reintegrated into the original video space, ensuring the preservation of the target face's dynamic attributes. To further achieve precise identity transfer with minimal artifacts and enhanced realism, we design a Partial Identity Modulation module that adaptively integrates source identity features using a spatial mask to restrict modifications to facial regions.Additionally, we introduce several fine-grained synchronization metrics to comprehensively evaluate the performance of video face swapping methods.Extensive experiments demonstrate that our method significantly outperforms existing approaches in terms of visual quality, temporal consistency, and identity preservation.
Poster
Yiran Qin · Li Kang · Xiufeng Song · Zhenfei Yin · Xiaohong Liu · Xihui Liu · Ruimao Zhang · LEI BAI

[ Exhibit Hall I ]

Abstract
Designing effective embodied multi-agent systems is critical for solving complex real-world tasks across domains. Due to the complexity of multi-agent embodied systems, existing methods fail to automatically generate safe and efficient training data for such systems. To this end, we propose the concept of compositional constraints for embodied multi-agent systems, addressing the challenges arising from collaboration among embodied agents. We design various interfaces tailored to different types of constraints, enabling seamless interaction with the physical world. Leveraging compositional constraints and specifically designed interfaces, we develop an automated data collection framework for embodied multi-agent systems and introduce the first benchmark for embodied multi-agent manipulation, RoboFactory. Based on RoboFactory benchmark, we adapt and evaluate the method of imitation learning and analyzed its performance in different difficulty agent tasks. Furthermore, we explore the architectures and training strategies for multi-agent imitation learning, aiming to build safe and efficient embodied multi-agent systems.
Poster
Lixing Xiao · Shunlin Lu · Huaijin Pi · Ke Fan · Liang Pan · Yueer Zhou · Ziyong Feng · Xiaowei Zhou · Sida Peng · Jingbo Wang

[ Exhibit Hall I ]

Abstract
This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. The code will be released for the reproducibility.
Poster
Jiaben Chen · Xin Yan · Yihang Chen · Siyuan Cen · Zixin Wang · Qinwei Ma · Haoyu Zhen · Kaizhi Qian · Lie Lu · Chuang Gan

[ Exhibit Hall I ]

Abstract
In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rapping vocals, lyrics, and high-quality 3D holistic body meshes. With the RapVerse dataset, we investigate the extent to which scaling autoregressive multimodal transformers across language, audio, and motion can enhance the coherent and realistic generation of vocals and whole-body human motions. For modality unification, a vector-quantized variational autoencoder is employed to encode whole-body motion sequences into discrete motion tokens, while a vocal-to-unit model is leveraged to obtain quantized audio tokens preserving content, prosodic information and singer identity. By jointly performing transformer modeling on these three modalities in a unified way, our framework ensures a seamless and realistic blend of vocals and human motions. Extensive experiments demonstrate that our unified generation framework not only produces coherent and realistic singing vocals alongside human motions directly from textual inputs, but also rivals the performance of specialized single-modality generation systems, establishing new benchmarks for joint vocal-motion generation. We encourage readers to watch the supplementary video …
Poster
Wentao Hu · Shunkai Li · Ziqiao Peng · Haoxian Zhang · Fan Shi · Xiaoqiang Liu · Pengfei Wan · Di ZHANG · Hui Tian

[ Exhibit Hall I ]

Abstract
Creating high-quality, generalizable speech-driven 3D talking heads remains a persistent challenge. Previous methods achieve satisfactory results for fixed viewpoints and small-scale audio variations, but they struggle with large head rotations and out-of-distribution (OOD) audio. Moreover, they are constrained by the need for time-consuming, identity-specific training. We believe the core issue lies in the lack of sufficient 3D priors, which limits the extrapolation capabilities of synthesized talking heads. To address this, we propose GGTalker, which synthesizes talking heads through a combination of generalizable priors and identity-specific adaptation. We introduce a two-stage Prior-Adaptation training strategy to learn Gaussian head priors and adapt to individual characteristics. We train Audio-Expression and Expression-Visual priors to capture the universal patterns of lip movements and the general distribution of head textures. During the Customized Adaptation, individual speaking styles and texture details are precisely modeled. Additionally, we introduce a color MLP to generate fine-grained, motion-aligned textures and a Body Inpainter to blend rendered results with the background, producing indistinguishable, photorealistic video frames. Comprehensive experiments show that GGTalker achieves state-of-the-art performance in rendering quality, 3D consistency, lip-sync accuracy, and training efficiency.
Poster
Tao Tang · Likui Zhang · Youpeng Wen · Kaidong Zhang · Jia-Wang Bian · xia zhou · Tianyi Yan · Kun Zhan · Peng Jia · Hefeng Wu · Liang Lin · Xiaodan Liang

[ Exhibit Hall I ]

Abstract
The development of generalist robot manipulation policies has seen significant progress, driven by large-scale demonstration data across diverse environments. However, the high cost and inefficiency of collecting real-world demonstrations hinder the scalability of data acquisition. While existing simulation platforms enable controlled environments for robotic learning, the challenge of bridging the sim-to-real gap remains. To address these challenges, we propose RoboPearls, an editable video simulation framework for robotic manipulation. Built on 3D Gaussian Splatting (3DGS), RoboPearls enables the construction of photo-realistic, view-consistent simulations from demonstration videos, and supports a wide range of simulation operators, including various object manipulations, powered by advanced modules like Incremental Semantic Distillation (ISD) and 3D regularized NNFM Loss (3D-NNFM). Moreover, by incorporating large language models (LLMs), RoboPearls automates the simulation production process in a user-friendly manner through flexible command interpretation and execution. Furthermore, RoboPearls employs a vision-language model (VLM) to analyze robotic learning issues to close the simulation loop for performance enhancement. To demonstrate the effectiveness of RoboPearls, we conduct extensive experiments on multiple datasets and scenes, including RLBench, COLOSSEUM, Ego4D, Open X-Embodiment, and a real-world robot, which demonstrate our satisfactory simulation performance.
Poster
Xiaomeng Chu · Jiajun Deng · Guoliang You · Wei Liu · Xingchen Li · Jianmin Ji · Yanyong Zhang

[ Exhibit Hall I ]

Abstract
Flexible instruction-guided 6-DoF grasping is a significant yet challenging task for real-world robotic systems. Existing methods utilize the contextual understanding capabilities of the large language models (LLMs) to establish mappings between expressions and targets, allowing robots to comprehend users' intentions in the instructions. However, the LLM's knowledge about objects' physical properties remains underexplored despite its tight relevance to grasping. In this work, we propose GraspCoT, a 6-DoF grasp detection framework that integrates a Chain-of-Thought (CoT) reasoning mechanism oriented to physical properties, guided by auxiliary question-answering (QA) tasks. Particularly, we design a set of QA templates to enable hierarchical reasoning that includes three stages: target parsing, physical property analysis, and grasp action selection. Moreover, GraspCoT presents a unified multimodal LLM architecture, which encodes multi-view observations of 3D scenes into 3D-aware visual tokens, and then jointly embeds these visual tokens with CoT-derived textual tokens within LLMs to generate grasp pose predictions. Furthermore, we present IntentGrasp, a large-scale benchmark that fills the gap in public datasets for multi-object grasp detection under diverse and indirect verbal commands. Extensive experiments on IntentGrasp demonstrate the superiority of our method, with additional validation in real-world robotic applications confirming its practicality. Codes and data will be released.
Poster
Ruofei WANG · Peiqi Duan · Boxin Shi · Renjie Wan

[ Exhibit Hall I ]

Abstract
With more event datasets being released online, safeguarding the event dataset against unauthorized usage has become a serious concern for data owners. Unlearnable Examples are proposed to prevent the unauthorized exploitation of image datasets. However, it's unclear how to create unlearnable asynchronous event streams to prevent event misuse. In this work, we propose the first unlearnable event stream generation method to prevent unauthorized training from event datasets. A new form of asynchronous event error-minimizing noise is proposed to perturb event streams, tricking the unauthorized model into learning embedded noise instead of realistic features. To be compatible with the sparse event, a projection strategy is presented to sparsify the noise to render our unlearnable event streams (UEvs). Extensive experiments demonstrate that our method effectively protects event data from unauthorized exploitation, while preserving their utility for legitimate use. We hope our UEvs contribute to the advancement of secure and trustworthy event dataset sharing.
Poster
Jaeha Kim · Junghun Oh · Kyoung Mu Lee

[ Exhibit Hall I ]

Abstract
Task-driven image restoration (TDIR) has recently emerged to address performance drops in high-level vision tasks caused by low-quality (LQ) inputs. The goal of TDIR is to improve both visual quality and task performance. Previous TDIR methods struggle to handle practical scenarios in which images are degraded by multiple complex factors, leaving minimal clues for restoration. This leads us to leverage the diffusion prior, one of the most powerful image priors. However, while the diffusion prior can help generate visually plausible results, using it to restore task-relevant details remains challenging, even when combined with state-of-the-art TDIR methods. To address this, we propose EDTR, the first TDIR method that incorporates diffusion prior in ways that harness its strength to restore task-relevant details. Specifically, we propose directly leveraging useful clues from LQ images in the diffusion process by generating from pre-restored LQ images with mild noise added. Moreover, we suggest one-step denoising to prevent the generation of redundant details that dilute crucial task-related information. We demonstrate that our method effectively utilizes diffusion prior to restore task-relevant details, significantly enhancing task performance and visual quality across diverse tasks with complex degradations.
Poster
Yandan Wang · Chenqi Guo · Yinglong Ma · Jiangyan Chen · Yuan Gao · Weiming Dong

[ Exhibit Hall I ]

Abstract
Skeleton-based action recognition faces class imbalance and insufficient labeling problems in real-world applications. Existing methods typically address these issues separately, lacking a unified framework that can effectively handle both issues simultaneously while considering their inherent relationships. Our theoretical analysis reveals two fundamental connections between these problems. First, class imbalance systematically shifts the eigenvalue spectrum of normalized affinity matrices, compromising both convergence and accuracy of label propagation. Second, boundary samples are critical for model training under imbalanced conditions but are often mistakenly excluded by conventional reliability metrics, which focus on relative class differences rather than holistic connectivity patterns. Built upon these theoretical findings, we propose SpeLER ($\textbf{Spe}$ctral-balanced $\textbf{L}$abel Propagation with $\textbf{E}$nergy-based Tightened $\textbf{R}$eliability), which introduces a spectral balancing technique that explicitly counteracts spectral shifts by incorporating class distribution information. Meanwhile, a propagation energy-based tightened reliability measure is proposed to better preserve crucial boundary samples by evaluating holistic connectivity patterns. Extensive experiments on six public datasets demonstrate that SpeLER consistently outperforms state-of-the-art methods, validating both our theoretical findings and practical effectiveness.
Poster
Kaiyang Ji · Ye Shi · Zichen Jin · Kangyi Chen · Lan Xu · Yuexin Ma · Jingyi Yu · Jingya Wang

[ Exhibit Hall I ]

Abstract
Real-time synthesis of physically plausible human interactions remains a critical challenge for immersive VR/AR systems and humanoid robotics. While existing methods demonstrate progress in kinematic motion generation, they often fail to address the fundamental tension between real-time responsiveness, physical feasibility, and safety requirements in dynamic human-machine interactions. We introduce Human-X, a novel framework designed to enable immersive and physically plausible human interactions across diverse entities, including human-avatar, human-humanoid, and human-robot systems. Unlike existing approaches that focus on post-hoc alignment or simplified physics, our method jointly predicts actions and reactions in real-time using an auto-regressive reaction diffusion planner, ensuring seamless synchronization and context-aware responses. To enhance physical realism and safety, we integrate an actor-aware motion tracking policy trained with reinforcement learning, which dynamically adapts to interaction partners’ movements while avoiding artifacts like foot sliding and penetration. Extensive experiments on the Inter-X and InterHuman datasets demonstrate significant improvements in motion quality, interaction continuity, and physical plausibility over state-of-the-art methods. Our framework is validated in real-world applications, including virtual reality interface for human-robot interaction, showcasing its potential for advancing human-robot collaboration.
Poster
Xilin He · Cheng Luo · Xiaole Xian · Bing Li · Siyang Song · Muhammad Haris Khan · Weicheng Xie · Linlin Shen · Zongyuan Ge · Bernard Ghanem · Xiangyu Yue

[ Exhibit Hall I ]

Abstract
Facial expression datasets remain limited in scale due to privacy concerns, the subjectivity of annotations, and the labor-intensive nature of data collection. This limitation poses a significant challenge for developing modern deep learning-based facial expression analysis models, particularly foundation models, that rely on large-scale data for optimal performance. To tackle the overarching and complex challenge, we introduce SynFER (Synthesis of Facial Expressions with Refined Control), a novel framework for synthesizing facial expression image data based on high-level textual descriptions as well as more fine-grained and precise control through facial action units. To ensure the quality and reliability of the synthetic data, we propose a semantic guidance technique to steer the generation process and a pseudo-label generator to help rectify the facial expression labels for the synthetic images. To demonstrate the generation fidelity and the effectiveness of the synthetic data from SynFER, we conduct extensive experiments on representation learning using both synthetic data and real-world data. Results validate the efficacy of our approach and the synthetic data. Notably, our approach achieves a 67.23% classification accuracy on AffectNet when training solely with synthetic data equivalent to the AffectNet training set size, which increases to 69.84% when scaling up to five times the …
Poster
Rolandos Alexandros Potamias · Stathis Galanakis · Jiankang Deng · Athanasios Papaioannou · Stefanos Zafeiriou

[ Exhibit Hall I ]

Abstract
Over the last years, 3D morphable models (3DMMs) have emerged as a state-of-the-art methodology for modeling and generating expressive 3D avatars. However, given their reliance on a strict topology, along with their linear nature, they struggle to represent complex full-head shapes. Following the advent of deep implicit functions (DIFs), we propose imHead, a novel implicit 3DMM that not only models expressive 3D head avatars but also facilitates localized editing of the facial features. Previous methods directly divided the latent space into local components accompanied by an identity encoding to capture the global shape variations, leading to expensive latent sizes. In contrast, we retain a single compact identity space and introduce an intermediate region-specific latent representation to enable local edits. To train imHead, we curate a large-scale dataset of over 4,500 identities, making a step-towards large scale 3D head modeling. Under a series of experiments we demonstrate the expressive power of the proposed model to represent diverse identities and expressions outperforming previous approaches. Additionally, the proposed approach provides an interpretable solution for 3D face manipulation, allowing the user to make localized edits. Models and data will be made publicly available for research purposes.
Poster
Li Hu · wang yuan · Zhen Shen · Xin Gao · Dechao Meng · Li'an Zhuo · Peng Zhang · Bang Zhang · Liefeng Bo

[ Exhibit Hall I ]

Abstract
Recent character image animation methods based on diffusion models, such as Animate Anyone, have made significant progress in generating consistent and generalizable character animations. However, these approaches fail to produce reasonable associations between characters and their environments. To address this limitation, we introduce Animate Anyone 2, aiming to animate characters with environment affordance. Beyond extracting motion signals from source video, we additionally capture environmental representations as conditional inputs. The environment is formulated as the region with the exclusion of characters and our model generates characters to populate these regions while maintaining coherence with the environmental context. We propose a shape-agnostic mask strategy that more effectively characterizes the relationship between character and environment. Furthermore, to enhance the fidelity of object interactions, we leverage an object guider to extract features of interacting objects and employ spatial blending for feature injection. We also introduce a pose modulation strategy that enables the model to handle more diverse motion patterns. Experimental results demonstrate the superior performance of the proposed method.
Poster
Fei Xie · Zhongdao Wang · Weijia Zhang · Chao Ma

[ Exhibit Hall I ]

Abstract
Mamba, an architecture with RNN-like sequence modeling of state space model (SSM), has demonstrated promising capabilities in long-range modeling with high efficiency. However, Mamba models struggle with structured 2D visual data using sequential computing, thereby lagging behind their attention-based counterparts. In this paper, we propose a Parallel Vision Mamba (PVMamba), a novel SSM architecture tailored for visual data. PVMamba encompasses two key designs: 1) Based on the sparsity and adjacency of visual signals, we parallelize the sequential computing through three core steps, termed Dynamic State Aggregation (DSA), i.e., parallelization, spatial alignment, and vectorized aggregation. DSA generates the hidden state in SSM by a feasible spatial aggregation, thereby overcoming the inherent sequential constraints. 2) Along with maintaining linear computational complexity, we apply a dynamic operator to learn the spatial samplings for each hidden state. To further boost the local modeling capability, we restrict the dynamic operator to the neighboring pixels in shallow layers. We also devise a layer multiplexing technique to stabilize the training and reduce the learning redundancy. PVMamba is a versatile backbone network with dynamic operators for various vision tasks, such as image classification and dense prediction. Extensive experiments show that PVMamba achieves state-of-the-art performance on a range of …
Poster
YuNing Gong · Jiaming Chen · Xiaohua Ren · Yuanjun Liao · Yanci Zhang

[ Exhibit Hall I ]

Abstract
Contemporary video stylization approaches struggle to achieve artistic stylization while preserving temporal consistency. While generator-based methods produce visually striking stylized results, they suffer from flickering artifacts in dynamic motion scenarios and require prohibitive computational resources. Conversely, non-generative techniques frequently show either temporal inconsistency or inadequate style preservation.We address these limitations by adapting the physics-inspired transport principles from the Transport-based Neural Style Transfer (TNST) framework (originally developed for volumetric fluid stylization) to enforce inter-frame consistency in video stylization.Our framework employs two complementary transformation fields for artistic stylization: a geometric stylization velocity field governing deformation and an orthogonality-regularized color transfer field managing color adaptations. We further strengthen temporal consistency through two key enhancements to our field architecture: a momentum-preserving strategy mitigating vibration artifacts, and an occlusion-aware temporal lookup strategy addressing motion trailing artifacts. Extensive experiments demonstrate FlowStyler's superior performance across dual dimensions: Compared to generator-based approaches, we achieve 4$\times$ lower short-term warping errors, while maintaining comparable style fidelity; Against non-generative methods, FlowStyler attains 22\% higher style fidelity with slightly improved temporal stability.
Poster
Chaonan Ji · Jinwei Qi · Peng Zhang · Bang Zhang · Liefeng Bo

[ Exhibit Hall I ]

Abstract
In this paper, we propose a novel diffusion-based multi-condition controllable framework for video head swapping, which seamlessly transplant a human head from a static image into a dynamic video, while preserving the original body and background of target video, and further allowing to tweak head expressions and movements during swapping as needed. Existing face-swapping methods mainly focus on localized facial replacement neglecting holistic head morphology, while head-swapping approaches struggling with hairstyle diversity and complex backgrounds, and none of these methods allow users to modify the transplanted head’s expressions after swapping. To tackle these challenges, our method incorporates several innovative strategies through a unified latent diffusion paradigm. 1) Identity-preserving context fusion: We propose a shape-agnostic mask strategy to explicitly disentangle foreground head identity features from background/body contexts, combining hair enhancement strategy to achieve robust holistic head identity preservation across diverse hair types and complex backgrounds. 2) Expression-aware landmark retargeting and editing: We propose a disentangled 3DMM-driven retargeting module that decouples identity, expression, and head poses, minimizing the impact of original expressions in input images and supporting expression editing. While a scale-aware retargeting strategy is further employed to minimize cross-identity expression distortion for higher transfer precision. Experimental results demonstrate that our method …
Poster
Ruiyang Ha · Songyi Jiang · Bin Li · Bikang Pan · Yihang Zhu · Junjie Zhang · Xiatian Zhu · Shaogang Gong · Jingya Wang

[ Exhibit Hall I ]

Abstract
Conventional person re-identification (ReID) research is often limited to single-modality sensor data from static cameras, which fails to address the complexities of real-world scenarios where multi-modal signals are increasingly prevalent. For instance, consider an urban ReID system integrating stationary RGB cameras, nighttime infrared sensors, and UAVs equipped with dynamic tracking capabilities. Such systems face significant challenges due to variations in camera perspectives, lighting conditions, and sensor modalities, hindering effective person ReID.To address these challenges, we introduce the MP-ReID benchmark, a novel dataset designed specifically for multi-modality and multi-platform ReID. This benchmark uniquely compiles data from 1,930 identities across diverse modalities, including RGB, infrared, and thermal imaging, captured by both UAVs and ground-based cameras in indoor and outdoor environments.Building on this benchmark, we introduce Uni-Prompt ReID, a framework with specific-designed prompts, tailored for cross-modality and cross-platform scenarios. Our method consistently outperforms state-of-the-art approaches, establishing a robust foundation for future research in complex and dynamic ReID environments. Additionally, our dataset will be made publicly available to support further advancements.
Poster
Chengjun Yu · Wei Zhai · Yuhang Yang · Yang Cao · Zheng-Jun Zha

[ Exhibit Hall I ]

Abstract
Human reaction generation represents a significant research domain for interactive AI, as humans constantly interact with their surroundings. Previous works focus mainly on synthesizing the reactive motion given a human motion sequence. This paradigm limits interaction categories to human-human interactions and ignores emotions that may influence reaction generation. In this work, we propose to generate 3D human reactions from RGB videos, which involves a wider range of interaction categories and naturally provides information about expressions that may reflect the subject's emotions. To cope with this task, we present HERO, a simple yet powerful framework for Human rEaction geneRation from videOs. HERO considers both global and frame-level local representations of the video to extract the interaction intention, and then uses the extracted interaction intention to guide the synthesis of the reaction. Besides, local visual representations are continuously injected into the model to maximize the exploitation of the dynamic properties inherent in videos. Furthermore‌, the ViMo dataset containing paired Video-Motion data is collected to support the task. In addition to human-human interactions, these video-motion pairs also cover animal-human interactions and scene-human interactions. Extensive experiments demonstrate the superiority of our methodology. The code and dataset will be publicly available.
Poster
Giacomo D'Amicantonio · Snehashis Majhi · Quan Kong · Lorenzo Garattoni · Gianpiero Francesca · Egor Bondarev · Francois Bremond

[ Exhibit Hall I ]

Abstract
Video Anomaly Detection (VAD) is a challenging task due to the variability of anomalous events and the limited availability of labeled data. Under the Weakly-Supervised VAD (WSVAD) paradigm, only video-level labels are provided during training, while predictions are made at the frame level. Although state-of-the-art models perform well on simple anomalies (e.g., explosions), they struggle with complex real-world events (e.g., shoplifting). This difficulty stems from two key issues: (1) the inability of current models to address the diversity of anomaly types, as they process all categories with a shared model, overlooking category-specific features; and (2) the weak supervision signal, which lacks precise temporal information, limiting the ability to capture nuanced anomalous patterns blended with normal events. To address these challenges, we propose Gaussian Splatting-guided Mixture of Experts (GS-MoE), a novel framework that employs a set of expert models, each specialized in capturing specific anomaly types. These experts are guided by a temporal Gaussian splatting loss, enabling the model to leverage temporal consistency and enhance weak supervision. The Gaussian splatting approach encourages a more precise and comprehensive representation of anomalies by focusing on temporal segments most likely to contain abnormal events. The predictions from these specialized experts are integrated through a …
Poster
Stefan A. Baumann · Nick Stracke · Timy Phan · Björn Ommer

[ Exhibit Hall I ]

Abstract
Understanding the dynamics of a physical scene involves reasoning about the diverse ways it can potentially change, especially as a result of local interactions. We present the Flow Poke Transformer (FPT), a novel framework for directly predicting the distribution of local motion, conditioned on sparse interactions termed "pokes". Unlike traditional methods that typically only enable dense sampling of a single realization of scene dynamics, FPT provides an interpretable directly accessible representation of multi-modal scene motion, its dependency on physical interactions and the inherent uncertainties of scene dynamics. We also evaluate our model on several downstream tasks to enable comparisons with prior methods and highlight the flexibility of our approach. On dense face motion generation, our generic pre-trained model surpasses specialized baselines. FPT can be fine-tuned in strongly out-of-distribution tasks such as synthetic datasets to enable significant improvements over in-domain methods in articulated object motion estimation. Additionally, predicting explicit motion distributions directly enables our method to achieve competitive performance on tasks like moving part segmentation from pokes which further demonstrates the versatility of our FPT.
Poster
Tewodros W. Ayalew · Xiao Zhang · Kevin Y Wu · Tianchong Jiang · Michael Maire · Matthew Walter

[ Exhibit Hall I ]

Abstract
We present PROGRESSOR, a novel framework that learns a task-agnostic reward function from videos, enabling policy training through goal-conditioned reinforcement learning (RL) without manual supervision. Underlying this reward is an estimate of the distribution over task progress as a function of the current, initial, and goal observations that is learned in a self-supervised fashion. Crucially, PROGRESSOR refines rewards adversarially during online RL training by pushing back high-variance predictions, to mitigate distribution shift inherent in non-expert observations. Utilizing this progress prediction as a dense reward together with an adversarial push-back, we show that PROGRESSOR enables robots to learn complex behaviors without any external supervision. Pretrained on large-scale egocentric human video from EPIC-KITCHENS, PROGRESSOR requires no fine-tuning on in-domain task-specific data for generalization to real-robot offline RL under noisy demonstrations, outperforming contemporary methods that provide dense visual reward for robotic learning. Our findings highlight the potential of PROGRESSOR for scalable robotic applications where direct action labels and task-specific rewards are not readily available.
Poster
Chirui CHANG · Jiahui Liu · Zhengzhe Liu · Xiaoyang Lyu · Yi-Hua Huang · Xin Tao · Pengfei Wan · Di ZHANG · Xiaojuan Qi

[ Exhibit Hall I ]

Abstract
Recent advancements in video diffusion models enable the generation of photorealistic videos with impressive 3D consistency and temporal coherence. However, the extent to which these AI-generated videos simulate the 3D visual world remains underexplored. In this paper, we introduce Learned 3D Evaluation (L3DE), an objective, quantifiable, and interpretable method for assessing AI-generated videos’ ability to simulate the real world in terms of 3D visual qualities and consistencies, without requiring manually labeled defects or quality annotations. Instead of relying on 3D reconstruction, which is prone to failure with in-the-wild videos, L3DE employs a 3D convolutional network, trained on monocular 3D cues of motion, depth, and appearance, to distinguish real from synthetic videos. Confidence scores from L3DE quantify the gap between real and synthetic videos in terms of 3D visual coherence, while a gradient-based visualization pinpoints unrealistic regions, improving interpretability. We validate L3DE through extensive experiments, demonstrating strong alignment with 3D reconstruction quality and human judgments. Our evaluations on leading generative models (e.g., Sora, MiniMax, and Kling) reveal persistent simulation gaps and subtle inconsistencies. Beyond generative video assessment, L3DE extends to broader applications: benchmarking video generation models, serving as a deepfake detector, and enhancing video synthesis by inpainting flagged inconsistencies.
Poster
Chaitanya Patel · Hiroki Nakamura · Yuta Kyuragi · Kazuki Kozuka · Juan Carlos Niebles · Ehsan Adeli

[ Exhibit Hall I ]

Abstract
Egocentric human motion generation and forecasting with scene-context is crucial for enhancing AR/VR experiences, improving human-robot interaction, advancing assistive technologies, and enabling adaptive healthcare solutions by accurately predicting and simulating movement from a first-person perspective. However, existing methods primarily focus on third-person motion synthesis with structured 3D scene contexts, limiting their effectiveness in real-world egocentric settings where limited field of view, frequent occlusions, and dynamic cameras hinder scene perception. To bridge this gap, we introduce Egocentric Motion Generation and Egocentric Motion Forecasting, two novel tasks that utilize first-person images for scene-aware motion synthesis without relying on explicit 3D scene. We propose UniEgoMotion, a unified conditional motion diffusion model with a novel head-centric motion representation tailored for egocentric devices. UniEgoMotion’s simple yet effective design supports egocentric motion reconstruction, forecasting, and generation from first-person visual inputs in a unified framework. Unlike previous works that overlook scene semantics, our model effectively extracts image-based scene context to infer plausible 3D motion. To facilitate training, we introduce EE4D-Motion, a large-scale dataset derived from EgoExo4D, augmented with pseudo-ground-truth 3D motion annotations. UniEgoMotion achieves state-of-the-art performance in egocentric motion reconstruction and is the first to generate motion from a single egocentric image. Extensive evaluations demonstrate the effectiveness …
Poster
Hyeonwoo Kim · Sangwon Baik · Hanbyul Joo

[ Exhibit Hall I ]

Abstract
Modeling how humans interact with objects is crucial for AI to effectively assist or mimic human behaviors.Existing studies for learning such ability primarily focus on static human-object interaction (HOI) patterns, such as contact and spatial relationships, while dynamic HOI patterns, capturing the movement of humans and objects over time, remain relatively underexplored.In this paper, we present a novel framework for learning Dynamic Affordance across various target object categories. To address the scarcity of 4D HOI datasets, our method learns the 3D dynamic affordance from synthetically generated 4D HOI samples. Specifically, we propose a pipeline that first generates 2D HOI videos from a given 3D target object using a pre-trained video diffusion model, then lifts them into 3D to generate 4D HOI samples.Leveraging these synthesized 4D HOI samples, we train DAViD, our generative 4D human-object interaction model, which is composed of two key components: (1) a human motion diffusion model (MDM) with Low-Rank Adaptation (LoRA) module to fine-tune a pre-trained MDM to learn the HOI motion concepts from limited HOI motion samples, (2) a motion diffusion model for 4D object poses conditioned by produced human interaction motions.Interestingly, DAViD can integrate newly learned HOI motion concepts with pre-trained human motions to create …
Poster
Rongjia Zheng · Qing Zhang · Chengjiang Long · Wei-Shi Zheng

[ Exhibit Hall I ]

Abstract
Recent methods have shown that pre-trained diffusion models can be fine-tuned to enable generative inverse rendering by learning image-conditioned noise-to-intrinsic mapping. Despite their remarkable progress, they struggle to robustly produce high-quality results as the noise-to-intrinsic paradigm essentially utilizes noisy images with deteriorated structure and appearance for intrinsic prediction, while it is common knowledge that structure and appearance information in an image are crucial for inverse rendering. To address this issue, we present DNF-Intrinsic, a robust yet efficient inverse rendering approach fine-tuned from a pre-trained diffusion model, where we propose to take the source image rather than Gaussian noise as input to directly predict deterministic intrinsic properties via flow matching. Moreover, we design a generative renderer to constrain that the predicted intrinsic properties are physically faithful to the source image. Experiments on both synthetic and real-world datasets show that our method clearly outperforms existing state-of-the-art methods. Our code and trained model will be made publicly available.
Poster
Longxin Kou · Fei Ni · Jianye HAO · Han Peilong · Jinyi Liu · Haiqin Cui · Rui Liu · YAN ZHENG

[ Exhibit Hall I ]

Abstract
Recent advances in robotics have produced numerous valuable large-scale demonstration datasets, yet their potential remains underutilized due to annotation limitations. Current datasets often suffer from sparse temporal annotations, and inconsistent labeling granularity, particularly for complex long-horizon demonstrations. Traditional manual annotation methods are expensive and poorly scalable while existing automated methods struggle with temporal coherence and semantic richness across extended demonstrations. For this, we propose RoboAnnotatorX, a reliable annotation tool that enhances multimodal large language model to generate high-quality, context-rich annotations for complex long-horizon demonstrations. Specifically, we introduce a multi-scale token-efficient encoder to maintain computational efficiency while simultaneously capturing fine-grained visual details and preserving temporal information by jointly integrating scene-level anchoring, clip-level temporal dynamics, and video-level global modeling. We further construct a comprehensive dataset RoboX-VQA that synthesizes diverse QA pairs from both real-world and simulated data, bridging the significant domain gap in robotics demonstrations. Moreover, we leverage a curriculum-inspired three-stage training to progressively develop capabilities from basic visual perception to sophisticated temporal reasoning. Extensive experiments demonstrate that RoboAnnotatorX significantly outperforms existing approaches in annotation quality and exhibits strong generalization across diverse robotic environments, helping unlock the full potential of existing robotic datasets.
Poster
Jaehwan Jeong · Sumin In · Sieun Kim · Shin yi · Jongheon Jeong · Sang Yoon · Jaewook Chung · Sangpil Kim

[ Exhibit Hall I ]

Abstract
The rising use of deepfakes in criminal activities presents a significant issue, inciting widespread controversy. While numerous studies have tackled this problem, most primarily focus on deepfake detection.These reactive solutions are insufficient as a fundamental approach for crimes where authenticity is disregarded.Existing proactive defenses also have limitations, as they are effective only for deepfake models based on specific Generative Adversarial Networks (GANs), making them less applicable in light of recent advancements in diffusion-based models.In this paper, we propose a proactive defense method named **FaceShield**, which introduces novel defense strategies targeting deepfakes generated by Diffusion Models (DMs) and facilitates defenses on various existing GAN-based deepfake models through facial feature extractor manipulations. Our approach consists of three main components: (i) manipulating the attention mechanism of DMs to exclude protected facial features during the denoising process, (ii) targeting prominent facial feature extraction models to enhance the robustness of our adversarial perturbation, and (iii) employing Gaussian blur and low-pass filtering techniques to improve imperceptibility while enhancing robustness against JPEG distortion.Experimental results on the CelebA-HQ and VGGFace2-HQ datasets demonstrate that our method achieves state-of-the-art performance against the latest deepfake models based on DMs, while also exhibiting transferability to GANs and showcasing greater imperceptibility of noise …
Poster
An Lun Liu · Yu-Wei Chao · Yi-Ting Chen

[ Exhibit Hall I ]

Abstract
In this paper, we study task-oriented human grasp synthesis, a new grasp synthesis task that demands both task and context awareness. At the core of our method is the task-aware contact maps. Unlike traditional contact maps that only reason about the manipulated object and its relation with the hand, our enhanced maps take into account scene and task information. This comprehensive map is critical for hand-object interaction, enabling accurate grasping poses that align with the task. We propose a two-stage pipeline that first constructs a task-aware contact map informed by the scene and task. In the subsequent stage, we use this contact map to synthesize task-oriented human grasps. We introduce a new dataset and metric for the proposed task to evaluate our approach. Our experiments validate the importance of modeling both scene and task, demonstrating significant improvements over existing methods in both grasp quality and task performance.
Poster
shanlin sun · Yifan Wang · Hanwen Zhang · Yifeng Xiong · Qin Ren · Ruogu Fang · Xiaohui Xie · Chenyu You

[ Exhibit Hall I ]

Abstract
While multi-step diffusion models have advanced both forward and inverse rendering, existing approaches often treat these problems independently, leading to cycle inconsistency and slow inference speed. In this work, we present Ouroboros, a framework composed of two single-step diffusion models that handle forward and inverse rendering with mutual reinforcement. Our approach extends intrinsic decomposition to both indoor and outdoor scenes and introduces a cycle consistency mechanism that ensures coherence between forward and inverse rendering outputs. Experimental results demonstrate state-of-the-art performance across diverse scenes while achieving substantially faster inference speed compared to other diffusion-based methods. We also demonstrate that Ouroboros can transfer to video decomposition in a training-free manner, reducing temporal inconsistency in video sequences while maintaining high-quality per-frame inverse rendering.
Poster
Jun Xiang · Yudong Guo · Leipeng Hu · Boyang Guo · Yancheng Yuan · Juyong Zhang

[ Exhibit Hall I ]

Abstract
Building realistic and animatable avatars still requires minutes of multi-view or monocular self-rotating videos, and most methods lack precise control over gestures and expressions. To push this boundary, we address the challenge of constructing a whole-body talking avatar from a single image. We propose a novel pipeline that tackles two critical issues: 1) complex dynamic modeling and 2) generalization to novel gestures and expressions. To achieve seamless generalization, we leverage recent pose-guided image-to-video diffusion models to generate imperfect video frames as pseudo-labels. To overcome the dynamic modeling challenge posed by inconsistent and noisy pseudo-frames, we introduce a tightly coupled 3DGS-mesh hybrid avatar representation and apply several key regularizations to mitigate inconsistencies caused by imperfect labels. Extensive experiments on diverse subjects demonstrate that our method enables the creation of a photorealistic, precisely animatable, and expressive whole-body talking avatar from just a single image.
Poster
Kefan Chen · Sergiu Oprea · Justin Theiss · Sreyas Mohan · Srinath Sridhar · Aayush Prakash

[ Exhibit Hall I ]

Abstract
With the rising interest from the community in digital avatars coupled with the importance of expressions and gestures in communication, modeling natural avatar behavior remains an important challenge across many industries such as teleconferencing, gaming, and AR/VR. Human hands are the primary tool for interacting with the environment and essential for realistic human behavior modeling, yet existing 3D hand and head avatar models often overlook the crucial aspect of hand-body interactions, such as between hand and face. We present InteracttAvatar, the first model to faithfully capture the photorealistic appearance of dynamic hand and non-rigid hand-face interactions. Our novel Dynamic Gaussian Hand model, combining template model and 3D Gaussian Splatting as well as a dynamic refinement module, captures pose-dependent change, e.g. the fine wrinkles and complex shadows that occur during articulation. Importantly, our hand-face interaction module models the subtle geometry and appearance dynamics that underlie common gestures.Through experiments of novel view synthesis, self reenactment and cross-identity reenactment, we demonstrate that InteracttAvatar can reconstruct hand and hand-face interactions from monocular or multiview videos with high-fidelity details and be animated with novel poses.
Poster
Zefeng Qian · Xincheng Yao · Yifei Huang · Chong-Yang Zhang · Jiangyong Ying · Hong Sun

[ Exhibit Hall I ]

Abstract
Few-shot action recognition (FSAR) aims to classify human actions in videos with only a small number of labeled samples per category. The scarcity of training data has driven recent efforts to incorporate additional modalities, particularly text. However, the subtle variations in human posture, object interactions, and the motion dynamics that occur during different phases of an action, are critical inherent knowledge of actions that cannot be fully exploited by relying solely on text within action labels.In this work, we propose Language-Guided Action Anatomy (LGA), a novel framework for FSAR that goes beyond label semantics by modeling actions at a finer granularity. LGA anatomizes both the textual and visual modalities, effectively exploring rich spatiotemporal cues across different temporal phases of actions.For text, prompt an off-the-shelf Large Language Model to anatomize labels into sequences of atomic action descriptions, focusing on the three core elements of action (subject, motion, object).For videos, we design a Visual Anatomy Module to segment actions into atomic video phases, capturing the sequential structure of actions.A fine-grained fusion strategy then integrates textual and visual features at the atomic level, resulting in more generalizable prototypes. Finally, we introduce a Multimodal Matching mechanism, comprising both video-video and video-text matching, to ensure …
Poster
Huu Phu Do · Yu-Wei Chen · Yi-Cheng Liao · Chi-Wei Hsiao · Han-Yang Wang · Wei-Chen Chiu · Ching-Chun Huang

[ Exhibit Hall I ]

Abstract
Blind Face Restoration aims to recover high-fidelity, detail-rich facial images from unknown degraded inputs, presenting significant challenges in preserving both identity and detail. Pre-trained diffusion models have been increasingly used as image priors to generate fine details. Still, existing methods often use fixed diffusion sampling timesteps and a global guidance scale, assuming uniform degradation. This limitation and potentially imperfect degradation kernel estimation frequently lead to under- or over-diffusion, resulting in an imbalance between fidelity and quality. We propose DynFaceRestore, a novel blind face restoration approach that learns to map any blindly degraded input to Gaussian blurry images. By leveraging these blurry images and their respective Gaussian kernels, we dynamically select the starting timesteps for each blurry image and apply closed-form guidance during the diffusion sampling process to maintain fidelity. Additionally, we introduce a dynamic guidance scaling adjuster that modulates the guidance strength across local regions, enhancing detail generation in complex areas while preserving structural fidelity in contours. This strategy effectively balances the trade-off between fidelity and quality. DynFaceRestore achieves state-of-the-art performance in both quantitative and qualitative evaluations, demonstrating robustness and effectiveness in blind face restoration.
Poster
Xudong Li · Zihao Huang · Yan Zhang · Yunhang Shen · Ke Li · Xiawu Zheng · Liujuan Cao · Rongrong Ji

[ Exhibit Hall I ]

Abstract
Image Quality Assessment (IQA) remains an unresolved challenge in the field of computer vision, due to complex distortion conditions, diverse image content, and limited data availability. The existing Blind IQA (BIQA) methods heavily rely on extensive human annotations to train models, which is both labor-intensive and costly due to the demanding nature of creating IQA datasets. To mitigate the dependence on labeled samples, this paper introduces a novel Gradient-Regulated Meta-Prompt IQA Framework (GRMP-IQA). This framework aims to fast adapt the powerful visual-language pre-trained model, CLIP, to downstream IQA tasks, significantly improving accuracy in scenarios with limited data. Specifically, the GRMP-IQA comprises two key modules: Meta-Prompt Pre-training Module and Quality-Aware Gradient Regularization. The Meta Prompt Pre-training Module leverages a meta-learning paradigm to pre-train soft prompts with shared meta-knowledge across different distortions, enabling rapid adaptation to various IQA tasks. On the other hand, the Quality-Aware Gradient Regularization is designed to adjust the update gradients during fine-tuning, focusing the model's attention on quality-relevant features and preventing overfitting to semantic information. Extensive experiments on five standard BIQA datasets demonstrate the superior performance to the state-of-the-art BIQA methods under limited data setting, i.e., achieving the SRCC values of 0.836 ( vs. 0.760 in LIVEC) and …
Poster
Fangyikang Wang · Hubery Yin · Lei Qian · Yinan Li · SHAOBIN ZHUANG · Huminhao Zhu · Yilin Zhang · Yanlong Tang · Chao Zhang · Hanbin Zhao · Hui Qian · Chen Li

[ Exhibit Hall I ]

Abstract
The emerging diffusion models (DMs) have demonstrated the remarkable capability of generating images via learning the noised score function of data distribution.Current DM sampling techniques typically rely on first-order Langevin dynamics at each noise level, with efforts concentrated on refining inter-level denoising strategies.While leveraging additional second-order Hessian geometry to enhance the sampling quality of Langevin is a common practice in Markov chain Monte Carlo (MCMC), the naive attempts to utilize Hessian geometry in high-dimensional DMs lead to quadratic-complexity computational costs, rendering them non-scalable.In this work, we introduce a novel Levenberg-Marquardt-Langevin (LML) method that approximates the diffusion Hessian geometry in a training-free manner, drawing inspiration from the celebrated Levenberg-Marquardt optimization algorithm.Our approach introduces two key innovations: (1) A low-rank approximation of the diffusion Hessian, leveraging the DMs' inherent structure and circumventing explicit quadratic-complexity computations; (2) A damping mechanism to stabilize the approximated Hessian.This LML approximated Hessian geometry enables the diffusion sampling to execute more accurate steps and improve the image generation quality.We further conduct theoretical analysis to substantiate the approximation error bound of low-rank approximation and the convergence property of damping mechanism. Extensive experiments across multiple pretrained DMs validate that the LML method significantly improves image generation quality, with negligible computational …
Poster
Taesung Kwon · Jong Ye

[ Exhibit Hall I ]

Abstract
In this paper, we propose a novel framework for solving high-definition video inverse problems using latent image diffusion models. Building on recent advancements in spatio-temporal optimization for video inverse problems using image diffusion models, our approach leverages latent-space diffusion models to achieve enhanced video quality and resolution. To address the high computational demands of processing high-resolution frames, we introduce a pseudo-batch consistent sampling strategy, allowing efficient operation on a single GPU. Additionally, to improve temporal consistency, we present pseudo-batch inversion, an initialization technique that incorporates informative latents from the measurement. By integrating with SDXL, our framework achieves state-of-the-art video reconstruction across a wide range of spatio-temporal inverse problems, including complex combinations of frame averaging and various spatial degradations, such as deblurring, super-resolution, and inpainting. Unlike previous methods, our approach supports multiple aspect ratios (landscape, vertical, and square) and delivers HD-resolution reconstructions (exceeding 1280×720) in under 6 seconds per frame on a single NVIDIA 4090 GPU. Project page: https://vision-xl.github.io/.
Poster
Kaname Yokoyama · Chihiro Nakatani · Norimichi Ukita

[ Exhibit Hall I ]

Abstract
This paper proposes dynamic human group detection in videos. For detecting complex groups, not only the local appearance features of in-group members but also the global context of the scene are important. Such local and global appearance features in each frame are extracted using a Vision-Language Model (VLM) augmented for group detection in our method. For further improvement, the group structure should be consistent over time. While previous methods are stabilized on the assumption that groups are not changed in a video, our method detects dynamically changing groups by global optimization using a graph with all frames' groupness probabilities estimated by our groupness-augmented CLIP features. Our experimental results demonstrate that our method outperforms state-of-the-art group detection methods on public datasets. Code: https://anonymous.4open.science/r/ICCV2025_DVT-D1A5
Poster
Hanqing Liu · Shouwei Ruan · Yao Huang · Shiji Zhao · Xingxing Wei

[ Exhibit Hall I ]

Abstract
Vision-Language Models (VLMs) have achieved remarkable success in various tasks, yet their robustness to real-world illumination variations remains largely unexplored. To bridge this gap, we propose $\textbf{I}$llumination $\textbf{T}$ransformation $\textbf{A}$ttack ($\textbf{ITA}$), the first framework to systematically assess VLMs' robustness against illumination changes. However, there still exist two key challenges: (1) how to model global illumination with fine-grained control to achieve diverse lighting conditions and (2) how to ensure adversarial effectiveness while maintaining naturalness. To address the first challenge, we innovatively decompose global illumination into multiple parameterized point light sources based on the illumination rendering equation. This design enables us to model more diverse lighting variations that previous methods could not capture. Then, by integrating these parameterized lighting variations with physics-based lighting reconstruction techniques, we could precisely render such light interactions in the original scenes, finally meeting the goal of fine-grained lighting control. For the second challenge, by controlling illumination through the lighting reconstrution model's latent space rather than direct pixel manipulation, we inherently preserve physical lighting priors. Furthermore, to prevent potential reconstruction artifacts, we design additional perceptual constraints for maintaining visual consistency with original images and diversity constraints for avoiding light source convergence. Extensive experiments demonstrate that our ITA could significantly …
Poster
Sébastien Herbreteau · Michael Unser

[ Exhibit Hall I ]

Abstract
Supervised deep learning has become the method of choice for image denoising. It involves the training of neural networks on large datasets composed of pairs of noisy and clean images. However, the necessity of training data that are specific to the targeted application constrains the widespread use of denoising networks. Recently, several approaches have been developed to overcome this difficulty by whether artificially generating realistic clean/noisy image pairs, or training exclusively on noisy images. In this paper, we show that, contrary to popular belief, denoising networks specialized in the removal of Gaussian noise can be efficiently leveraged in favor of real-world image denoising, even without additional training. For this to happen, an appropriate variance-stabilizing transform (VST) has to be applied beforehand. We propose an algorithm termed Noise2VST for the learning of such a model-free VST. Our approach requires only the input noisy image and an off-the-shelf Gaussian denoiser. We demonstrate through extensive experiments the efficiency and superiority of Noise2VST in comparison to existing methods trained in the absence of specific clean/noisy pairs.
Poster
Xuhong Huang · Shiqi Liu · Kai Zhang · Ying Tai · Jian Yang · Hui Zeng · Lei Zhang

[ Exhibit Hall I ]

Abstract
Convolution and transposed convolution are fundamental operators widely used in neural networks. However, transposed convolution, a.k.a. deconvolution, does not truly invert convolution due to their inherent differences in formulation. To date, there is no reverse convolution operator that has been developed as a basic component in deep neural networks. In this paper, we propose a novel depthwise reverse convolution operator as a first-step exploration to effectively reverse the depthwise convolution by formulating and solving a regularized least-squares optimization problem. We thoroughly investigate its kernel initialization, padding strategies, and other critical aspects to ensure its effective implementation. Building upon this reverse convolution operator, we integrate it with layer normalization, 1$\times$1 convolution, and GELU activation to form a reverse convolution block, similar to a Transformer block. The proposed reverse convolution block can easily replace its convolution and transposed convolution counterparts in existing architectures, leading to the development of ConverseNet. By incorporating it into classical models like DnCNN, SRResNet and USRNet, we train ConverseNet to solve three typical image restoration tasks including Gaussian denoising, super-resolution and deblurring. Extensive experiments demonstrate the effectiveness of the proposed reverse convolution operator as both a fundamental building block and a novel deconvolution operator for inverse problems. We …
Poster
Liyuan Deng · Yunpeng Bai · Yongkang Dai · Xiaoshui Huang · Hongping Gan · Dongshuo Huang · Hao jiacheng · Yilei Shi

[ Exhibit Hall I ]

Abstract
Parametric Computer-Aided Design (CAD) is crucial in industrial applications, yet existing approaches often struggle to generate long sequence parametric commands due to complex CAD models' geometric and topological constraints. To address this challenge, we propose MamTiff-CAD, a novel CAD parametric command sequences generation framework that leverages a Transformer-based diffusion model for multi-scale latent representations. Specifically, we design a novel autoencoder that integrates Mamba+ and Transformer, to transfer parameterized CAD sequences into latent representations. The Mamba+ block incorporates a forget gate mechanism to effectively capture long-range dependencies. The non-autoregressive Transformer decoder reconstructs the latent representations. A diffusion model based on multi-scale Transformer is then trained on these latent embeddings to learn the distribution of long sequence commands. In addition, we also construct a dataset that consists of long parametric sequences, which is up to 256 commands for a single CAD model. Experiments demonstrate that MamTiff-CAD achieves state-of-the-art performance on both reconstruction and generation tasks, confirming its effectiveness for long sequence (60-256) CAD model generation.
Poster
Md Ashiqur Rahman · Chiao-An Yang · Michael N Cheng · Lim Hao · Jeremiah Jiang · Teck-Yian Lim · Raymond A. Yeh

[ Exhibit Hall I ]

Abstract
Scale variation is a fundamental challenge in computer vision. Objects of the same class can have different sizes, and their perceived size is further affected by the distance from the camera. These variations are local to the objects, i.e., different object sizes may change differently within the same image. To effectively handle scale variations, we present a deep equilibrium canonicalizer (DEC) to improve the local scale equivariance of a model. DEC can be easily incorporated into existing network architectures and can be adapted to a pre-trained model. Notably, we show that on the competitive ImageNet benchmark, DEC improves both model performance and local scale consistency across four popular pre-trained deep-nets, e.g., ViT, DeiT, Swin, and BEiT.
Poster
Chenzhong Gao · Wei Li · Desheng Weng

[ Exhibit Hall I ]

Abstract
An exploration of cross-arbitrary-modal image invariant feature extraction and matching is made, with a purely handcrafted full-chain algorithm, Homomorphism of Organized Major Orientation (HOMO), being proposed. Instead of using deep models to conduct data-driven black-box learning, we introduce a Major Orientation Map (MOM), effectively combating image modal differences. Considering rotation, scale, and texture diversities in cross-modal images, HOMO incorporates a novel, universally designed Generalized-Polar descriptor (GPolar) and a Multi-scale Strategy (MsS) to gain well-rounded capacity. HOMO achieves the best comprehensive performance in feature matching on a several generally cross-modal datasets, challenging compared with a set of state-of-the-art methods including 7 traditional algorithms and 10 deep network models. A dataset named General Cross-modal Zone (GCZ) is proposed, which shows practical values.
Poster
Xirui Hu · Jiahao Wang · Hao chen · Weizhan Zhang · Benqi Wang · yikun Li · Haishun Nan

[ Exhibit Hall I ]

Abstract
Recent advancements in text-to-image generation have spurred interest in personalized human image generation, which aims to create novel images featuring specific human identities as reference images indicate. Although existing methods achieve high-fidelity identity preservation, they often struggle with limited multi-ID usability and inadequate facial editability. We present DynamicID, a tuning-free framework supported by a dual-stage training paradigm that inherently facilitates both single-ID and multi-ID personalized generation with high fidelity and flexible facial editability. Our key innovations include: 1) Semantic-Activated Attention (SAA), which employs query-level activation gating to minimize disruption to the original model when injecting ID features and achieve multi-ID personalization without requiring multi-ID samples during training. 2) Identity-Motion Reconfigurator (IMR), which leverages contrastive learning to effectively disentangle and re-entangle facial motion and identity features, thereby enabling flexible facial editing. Additionally, we have developed the curated VariFace-10k facial dataset, comprising 10k unique individuals, each represented by 35 distinct facial images. Experimental results demonstrate that DynamicID outperforms state-of-the-art methods in identity fidelity, facial editability, and multi-ID personalization capability.
Poster
Yiwen Zhao · Yang Wang · Liting Wen · Hengyuan Zhang · Xingqun Qi

[ Exhibit Hall I ]

Abstract
Generating harmonic and diverse human motions from music signals, especially for multi-person group dance, is a practical yet challenging task in virtual avatar creation.Existing methods merely model the group dance with a fixed number of dancers, lacking the flexibility to generate arbitrary individual group movements. To fulfill this goal, we propose a novel unified framework capable of synthesizing free-number dancers harmonically aligned with given music, namely $\textbf{\textit{FreeDance}}$. Considering the plausibility of arbitrary dancer generation while preserving the diverse dynamics of multiple individuals, we build the framework upon collaborative masked token modeling in 2D discrete space. In particular, we devise a $\textbf{\textit{Cross-modality Residual Alignment Module (CRAM)}}$ to diversify the movement of each individual and intensify its alignment with music.CRAM captures the spatial motion deformation of each dancer using residual learning and integrates it with rhythmic representation into a joint embedding. We leverage this joint embedding to enhance cross-entity alignment while reinforcing the intrinsic connection between motion and music.Moreover, recognizing the requirement of interactive coordination of generated multi-dancer motions, we design a $\textbf{\textit{Temporal Interaction Module (TIM)}}$.Benefiting from masked 2D motion tokens, TIM effectively models the temporal correlation between current individuals w.r.t neighboring dancers as interaction guidance to foster stronger inter-dancer dependencies.Extensive experiments …
Poster
Ron Raphaeli · Sean Man · Michael Elad

[ Exhibit Hall I ]

Abstract
Plug-and-play methods for solving inverse problems have continuously improved over the years by incorporating more advanced image priors.Latent diffusion models are among the most powerful priors, making them a natural choice for solving inverse problems. However, existing approaches require multiple applications of an Autoencoder to transition between pixel and latent spaces during restoration, leading to high computational costs and degraded restoration quality. In this work, we introduce a new plug-and-play paradigm that operates entirely in the latent space of diffusion models. By emulating pixel-space degradations directly in the latent space through a short learning phase, we eliminate the need for the Autoencoder during restoration, enabling faster inference and improved restoration fidelity. We validate our method across various image restoration tasks and datasets, achieving significantly higher perceptual quality than previous methods while being $2.6{-}10{\times}$ faster in inference and $1.7{-}7{\times}$ faster when accounting for the learning phase of the latent operator.
Poster
Li · Nikolaos Tsagkas · Jifei Song · Ruaridh Mon-Williams · Sethu Vijayakumar · Kun Shao · Laura Sevilla-Lara

[ Exhibit Hall I ]

Abstract
Affordance, defined as the potential actions that an object offers, is crucial for embodied AI agents. For example, such knowledge directs an agent to grasp a knife by the handle for cutting or by the blade for safe handover. While existing approaches have made notable progress, affordance research still faces three key challenges: data scarcity, poor generalization, and real-world deployment. Specifically, there is a lack of large-scale affordance datasets with precise segmentation maps, existing models struggle to generalize across different domains or novel object and affordance classes, and little work demonstrates deployability in real-world scenarios. In this work, we address these issues by proposing a complete affordance learning system that (1) takes in egocentric videos and outputs precise affordance annotations without human labeling, (2) leverages geometric information and vision foundation models to improve generalization, and (3) introduces a framework that facilitates affordance-oriented robotic manipulation such as tool grasping and robot-to-human tool handover. Experimental results show that our model surpasses the state-of-the-art by 13.8% in mIoU, and the framework achieves 77.1% successful grasping among 179 trials, including evaluations on seen, unseen classes, and cluttered scenes.
Poster
Yufei Cai · Hu Han · Yuxiang Wei · Shiguang Shan · Xilin Chen

[ Exhibit Hall I ]

Abstract
The progress on generative models has led to significant advances on text-to-video (T2V) generation, yet the motion controllability of generated videos remains limited. Existing motion transfer approaches explored the motion representations of reference videos to guide generation. Nevertheless, these methods typically rely on sample-specific optimization frameworks, resulting in high computational burdens. In this paper, we propose EfficientMT, a novel and efficient end-to-end framework for video motion transfer. By leveraging a small set of synthetic paired motion transfer samples, EfficientMT effectively adapts a pretrained T2V model into a general motion transfer framework that can accurately capture and reproduce diverse motion patterns. Specifically, we repurpose the backbone of the T2V model to extract temporal information from reference videos, and further propose a scaler module to distill motion-related information. Subsequently, we introduce a temporal integration mechanism that seamlessly incorporates reference motion features into the video generation process. After training on our self-collected synthetic paired samples, EfficientMT enables general video motion transfer without requiring test-time optimization. Extensive experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability. Our code will be made publicly available.
Poster
Zeyuan Chen · Hongyi Xu · Guoxian Song · You Xie · Chenxu Zhang · Xin Chen · Chao Wang · Di Chang · Linjie Luo

[ Exhibit Hall I ]

Abstract
We present X-Dancer, a novel zero-shot music-driven image animation pipeline that creates diverse and long-range lifelike human dance videos from a single static image. As its core, we introduce a unified transformer-diffusion framework, featuring an autoregressive transformer model that synthesize extended and music-synchronized token sequences for 2D body, head and hands poses, which then guide a diffusion model to produce coherent and realistic dance video frames. Unlike traditional methods that primarily generate human motion in 3D, X-Dancer addresses data limitations and enhances scalability by modeling a wide spectrum of 2D dance motions, capturing their nuanced alignment with musical beats through readily available monocular videos. To achieve this, we first build a spatially compositional token representation from 2D human pose labels associated with keypoint confidences, encoding both large articulated body movements (e.g., upper and lower body) and fine-grained motions (e.g., head and hands). We then design a music-to-motion transformer model that autoregressively generates music-aligned dance pose token sequences, incorporating global attention to both musical style and prior motion context. Finally we leverage a diffusion backbone to animate the reference image with these synthesized pose tokens through AdaIN, forming a fully differentiable end-to-end framework. Experimental results demonstrate that X-Dancer is able to …
Poster
Ruowen Zhao · James Jun Liang Chen Ye · Zhengyi Wang · Guangce Liu · Yiwen Chen · Yikai Wang · Jun Zhu

[ Exhibit Hall I ]

Abstract
Triangle meshes play a crucial role in 3D applications for efficient manipulation and rendering. While auto-regressive methods generate structured meshes by predicting discrete vertex tokens, they are often constrained by limited face counts and mesh incompleteness. To address these challenges, we propose DeepMesh, a framework that optimizes mesh generation through two key innovations: (1) an efficient pre-training strategy incorporating a novel tokenization algorithm, along with improvements in data curation and processing, and (2) the introduction of Reinforcement Learning (RL) into 3D mesh generation to achieve human preference alignment via Direct Preference Optimization (DPO). We design a scoring standard that combines human evaluation with 3D metrics to collect preference pairs for DPO, ensuring both visual appeal and geometric accuracy. Conditioned on point clouds and images, DeepMesh generates meshes with intricate details and precise topology, outperforming state-of-the-art methods in both precision and quality.
Poster
Yi Qin · Rui Wang · Tao Huang · Tong Xiao · Liping Jing

[ Exhibit Hall I ]

Abstract
While the Segment Anything Model (SAM) transforms interactive segmentation with zero-shot abilities, its inherent vulnerabilities present a single-point risk, potentially leading to the failure of downstream applications. Proactively evaluating these transferable vulnerabilities is thus imperative. Prior adversarial attacks on SAM often present limited transferability due to insufficient exploration of common weakness across domains. To address this, we propose a novel method, Vertex-Refining Simplicial Complex Attack (VeSCA), generating transferable adversarial examples by explicitly characterizing the shared vulnerable regions between SAM and downstream models through a parametric simplicial complex. Our goal is to identify such complexes within adversarially potent regions by iterative vertex-wise refinement.A lightweight domain re-adaptation strategy is introduced to bridge domain divergence using minimal reference data. Notably, VeSCA leverages only the encoder of SAM, which mitigates overfitting issue, and generates consistently transferable adversarial examples by random simplicial complex sampling. Extensive experiments demonstrate that VeSCA achieves performance improved by 12.7\% compared to state-of-the-art methods across three downstream model categories across five domain-specific datasets. Our findings further highlight the downstream model risks posed by SAM’s vulnerabilities.
Poster
Feng Huang · Shuyuan Zheng · Zhaobing Qiu · Huanxian Liu · huanxin Bai · Liqiong Chen

[ Exhibit Hall I ]

Abstract
Infrared small target detection is currently a hot and challenging task in computer vision. Existing methods usually focus on mining visual features of targets, which struggles to cope with complex and diverse detection scenarios. The main reason is that infrared small targets have limited image information on their own, thus relying only on visual features fails to discriminate targets and interferences, leading to lower detection performance. To address this issue, we introduce a novel approach leveraging semantic text to guide infrared small target detection, called Text-IRSTD. It innovatively expands classical IRSTD to text-guided IRSTD, providing a new research idea. On the one hand, we devise a novel fuzzy semantic text prompt to accommodate ambiguous target categories. On the other hand, we propose a progressive cross-modal semantic interaction decoder (PCSID) to facilitate information fusion between texts and images. In addition, we construct a new benchmark consisting of 2,755 infrared images of different scenarios with fuzzy semantic textual annotations, called FZDT. Extensive experimental results demonstrate that our method achieves better detection performance and target contour recovery than the state-of-the-art methods. Moreover, proposed Text-IRSTD shows strong generalization and wide application prospects in unseen detection scenarios. The dataset and code will be publicly released …
Poster
Haifeng Zhong · Fan Tang · Zhuo Chen · Hyung Jin Chang · Yixing Gao

[ Exhibit Hall I ]

Abstract
The challenge of multimodal semantic segmentation lies in establishing semantically consistent and segmentable multimodal fusion features under conditions of significant visual feature discrepancies. Existing methods commonly construct cross-modal self-attention fusion frameworks or introduce additional multimodal fusion loss functions to establish fusion features. However, these approaches often overlook the challenge caused by feature discrepancies between modalities during the fusion process. To achieve precise segmentation, we propose an Attention-Driven Multimodal Discrepancy Alignment Network (AMDANet). AMDANet reallocates weights to reduce the saliency of discrepant features and utilizes low-weight features as cues to mitigate discrepancies between modalities, thereby achieving multimodal feature alignment. Furthermore, to simplify the feature alignment process, a semantic consistency inference mechanism is introduced to reveal the network's inherent bias toward specific modalities, thereby compressing cross-modal feature discrepancies from the foundational level.Extensive experiments on the FMB, MFNet, and PST900 datasets demonstrate that AMDANet achieves mIoU improvements of 3.6%, 3.0%, and 1.6%, respectively, significantly outperforming state-of-the-art methods.
Poster
Vanessa Sklyarova · Egor Zakharov · Malte Prinzler · Giorgio Becherini · Michael Black · Justus Thies

[ Exhibit Hall I ]

Abstract
We present a novel approach for hair reconstruction from single photographs based on a global hair prior combined with local optimization.Capturing strand-based hair geometry from single photographs is challenging due to the variety and geometric complexity of hairstyles and the lack of ground truth training data.Classical reconstruction methods like multi-view stereo only reconstruct the visible hair strands, missing the inner structure of hair and hampering realistic hair simulation.To address this, existing methods leverage hairstyle priors trained on synthetic data.Such data, however, is limited in both quantity and quality since it requires manual work from skilled artists to model the 3D hairstyles and create nearly-photorealistic renderings.To address this, we propose a novel approach that uses both, real and synthetic data to learn an effective hairstyle prior.Specifically, we train a transformer-based prior model on synthetic data to obtain knowledge of the internal hairstyle geometry and introduce real data in the learning process to model the outer structure.This training scheme is able to model the visible hair strands depicted in an input image, while preserving the general structure of hairstyles.We exploit this prior to create a Gaussian-splatting-based reconstruction method that creates hairstyles from one or more images.Through qualitative and quantitative comparisons with existing …
Poster
Xinyue Li · Zhangkai Ni · Wenhan Yang

[ Exhibit Hall I ]

Abstract
Existing learning-based methods effectively reconstruct HDR images from multi-exposure LDR inputs with extended dynamic range and improved detail, but their black-box design restricts interpretability and consistency. To address these limitations, we propose the cross-iterative Alignment and Fusion deep Unfolding Network (AFUNet), where HDR reconstruction is systematically decoupled into two interleaved subtasks—alignment and fusion—optimized through alternating refinement, achieving synergy between the two subtasks to enhance the overall performance. Our method formulates multi-exposure HDR reconstruction from a Maximum A Posteriori (MAP) estimation perspective, explicitly incorporating spatial correspondence priors across LDR images and naturally bridging the alignment and fusion subproblems through joint constraints. Building on the mathematical foundation, we reimagine traditional iterative optimization through unfolding—transforming the conventional solution process into an end-to-end trainable AFUNet with carefully designed modules that work progressively. Specifically, each iteration of AFUNet incorporates an Alignment-Fusion Module (AFM) that alternates between a Spatial Alignment Module (SAM) for alignment and a Channel Fusion Module (CFM) for adaptive feature fusion, progressively bridging misaligned content and exposure discrepancies. Extensive qualitative and quantitative evaluations demonstrate AFUNet’s superior performance, consistently surpassing state-of-the-art methods. Our codes will be made available.
Poster
Sakuya Ota · Qing Yu · Kent Fujiwara · Satoshi Ikehata · Ikuro Sato

[ Exhibit Hall I ]

Abstract
Generating realistic group interactions involving multiple characters remains challenging due to increasing complexity as group size expands. While existing conditional diffusion models incrementally generate motions by conditioning on previously generated characters, they rely on single shared prompts, limiting nuanced control and leading to overly simplified interactions. In this paper, we introduce Person-Interaction Noise Optimization (PINO), a novel, training-free framework designed for generating realistic and customizable interactions among groups of arbitrary size. PINO decomposes complex group interactions into sequential, semantically relevant pairwise interactions, leveraging pretrained two-person interaction diffusion models. To ensure physical plausibility and avoid common artifacts such as overlapping or penetration between characters, PINO employs physics-based penalties during noise optimization. This approach allows precise user control over character orientation, speed, and spatial relationships without additional training. Comprehensive evaluations demonstrate that PINO generates visually realistic, physically coherent, and adaptable multi-person interactions suitable for diverse animation, gaming, and robotics applications.
Poster
Yanwen Wang · Yiyu Zhuang · Jiawei Zhang · Li Wang · Yifei Zeng · Xun Cao · Xinxin Zuo · Hao Zhu

[ Exhibit Hall I ]

Abstract
Efficient 3D avatar creation is a significant demand in the metaverse, film/game, AR/VR, etc. In this paper, we rethink text-to-avatar generative models by proposing TeRA, a more efficient and effective framework than the previous SDS-based models and general large 3D generative models. Our approach employs a two-stage training strategy for learning a native 3D avatar generative model. Initially, we distill a deencoder to derive a structured latent space from a large human reconstruction model. Subsequently, a text-controlled latent diffusion model is trained to generate photorealistic 3D human avatars within this latent space. TeRA enhances the model performance by eliminating slow iterative optimization and enables text-based partial customization through a structured 3D human representation. Experiments have proven our approach's superiority over previous text-to-avatar generative models in subjective and objective evaluation. The code and data will be publicly released upon publication.
Poster
Jeongeun Park · Sungjoon Choi · Sangdoo Yun

[ Exhibit Hall I ]

Abstract
Recent advancements in large language models (LLMs) have significantly improved their ability to generate natural and contextually relevant text, enabling more human-like AI interactions. However, generating and understanding interactive human-like motion, where multiple individuals engage in coordinated movements, remains challenging due to the complexity of modeling these interactions. Additionally, a unified and versatile model is needed to handle diverse interactive scenarios, such as chat systems that dynamically adapt to user instructions and assigned roles.To address these challenges, we introduce VIM, the Versatile Interactive Motion-language model, which integrates both language and motion modalities to effectively understand, generate, and control interactive motions in multi-turn conversational contexts. Unlike previous studies that primarily focus on uni-directional tasks such as text-to-motion or motion-to-text, VIM employs a unified architecture capable of simultaneously understanding and generating both motion and text modalities.Given the absence of an appropriate dataset to support this task, we introduce Inter-MT$^2$, a large-scale instruction-tuning dataset containing 82.7K multi-turn interactive motion instructions, covering 153K interactive motion samples. Inter-MT$^2$ spans diverse instructional scenarios, including motion editing, question answering, and story generation, leveraging off-the-shelf large language models and motion diffusion models to construct a broad set of interactive motion instructions.We extensively evaluate the versatility of VIM across …
Poster
Jingwen Deng · Zihao Wang · Shaofei Cai · Anji Liu · Yitao Liang

[ Exhibit Hall I ]

Abstract
Learning skills in open-world environments is essential for developing agents capable of handling a variety of tasks by combining basic skills.Online demonstration videos are typically long and unsegmented, making them difficult to segment and label with skill identifiers.Unlike existing methods that rely on sequence sampling or human labeling, we have developed a self-supervised learning-based approach to segment these long videos into a series of semantic-aware and skill-consistent segments.Drawing inspiration from human cognitive event segmentation theory, we introduce **Skill Boundary Detection** (SBD), an annotation-free temporal video segmentation algorithm. SBD detects skill boundaries in a video by leveraging prediction errors from a pretrained unconditional action-prediction model. This approach is based on the assumption that a significant increase in prediction error indicates a shift in the skill being executed. We evaluated our method in the Minecraft environment, a rich open-world simulator with extensive gameplay videos available online. Our SBD-generated segments improved the average performance of two conditioned policies by 63.7\% and 52.1\% on short-term atomic skill tasks, and their corresponding hierarchical agents by 11.3\% and 20.8\% on long-horizon tasks.
Poster
Elena Bueno-Benito · Mariella Dimiccoli

[ Exhibit Hall I ]

Abstract
Unsupervised action segmentation has recently pushed its limits with ASOT, an optimal transport (OT)-based method that simultaneously learns action representations and performs clustering using pseudo-labels. Unlike other OT-based approaches, ASOT makes no assumptions on the action ordering, and it is able to decode a temporally consistent segmentation from a noisy cost matrix between video frames and action labels. However, the resulting segmentation lacks segment-level supervision, which limits the effectiveness of the feedback between frames and action representations. To address this limitation, we propose Closed Loop Optimal Transport (CLOT), a novel OT-based framework that introduces a multi-level cyclic feature learning mechanism. Leveraging its encoder-decoder architecture, CLOT learns pseudo-labels alongside frame and segment embeddings by solving two separate OT problems. It then refines both frame embeddings and pseudo-labels through cross-attention between the learned frame and segment embeddings, integrating a third OT problem. Experimental results on four benchmark datasets demonstrate the benefits of cyclical learning for unsupervised action segmentation.
Poster
Hebaixu Wang · Jiayi Ma

[ Exhibit Hall I ]

Abstract
In the field of pan-sharpening, existing deep methods are hindered in deepening cross-modal complementarity in the intermediate feature, and lack effective strategies to harness the network entirety for optimal solutions, exhibiting limited feasibility and interpretability due to their black-box designs. Besides, validating pan-sharpening performance in high-level semantic tasks is intractable for the absence of datasets. To tackle these issues, we propose a deep adaptive unfolded network via spatial morphology stripping and spectral filtration for pan-sharpening, which is conceptualized as a linear inverse problem regularized by spatial and spectral priors. Specifically, we incorporate phase-oriented constraints into spatial priors to facilitate the thorough extraction of modal-invariant spatial morphology by intrinsic decomposition and leverage physics-driven spectral filtration attention mechanisms aligned with spectral prior to mine the inherent spectral correlation. After transparently unfolding the model into a multi-stage network, an adaptive stage-exiting mechanism is designed to capitalize on fusion diversity by aggregating optimal image patches across candidate stages. To systematically complete the assessment, we construct the first panoptic segmentation as a semantic-level benchmark for pan-sharpening performance validation. Extensive experiments are conducted to verify the merits of our method with state-of-the-art methods.
Poster
Sanjoy Chowdhury · Subrata Biswas · Sayan Nag · Tushar Nagarajan · Calvin Murdock · Ishwarya Ananthabhotla · Yijun Qian · Vamsi Ithapu · Dinesh Manocha · Ruohan Gao

[ Exhibit Hall I ]

Abstract
Modern perception models, particularly those designedfor multisensory egocentric tasks, have achieved remark-able performance but often come with substantial compu-tational costs. These high demands pose challenges forreal-world deployment, especially in resource-constrainedenvironments. In this paper, we introduce EGOADAPT, aframework that adaptively performs cross-modal distilla-tion and policy learning to enable efficient inference acrossdifferent egocentric perception tasks, including egocentricaction recognition, active speaker localization, and behav-ior anticipation. Our proposed policy module is adapt-able to task-specific action spaces, making it broadly appli-cable. Experimental results on three challenging egocen-tric datasets—EPIC-Kitchens, EasyCom, and Aria Every-day Activities—demonstrate that our method significantlyenhances efficiency, reducing GMACs by up to 89.09%, pa-rameters up to 82.02%, and energy up to 9.6×, while stillon-par and in many cases outperforming, the performanceof corresponding state-of-the-art models.
Poster
CHEN LIANG · Wenguan Wang · Yi Yang

[ Exhibit Hall I ]

Abstract
Building autonomous agents that can replicate human behavior in the realistic 3D world is a key step toward artificial general intelligence. This requires agents to be holistic goal achievers and to naturally adapt to environmental dynamics. In this work, we introduce ACTOR, an agent capable of performing high-level, long-horizon, abstract goals in 3D households, guided by its internal value similar to those of humans. ACTOR operates in a perceive-plan-act cycle, extending the ungrounded, scene-agnostic LLM controller with deliberate goal decomposition and decision-making through actively searching the behavior space, generating activity choices based on a hierarchical prior, and evaluating these choices using customizable value functions to determine the subsequent steps. Furthermore, we introduce BehaviorHub, a large-scale human behavior simulation dataset in scene-aware, complicated tasks. Considering the unaffordable acquisition of human-authored 3D human behavior data, we construct BehaviorHub by exploring the commonsense knowledge of LLMs learned from large corpora, and automatically aligning motion resources with 3D scene for knowledgeable generation. Extensive experiments on our established benchmark demonstrate that the proposed architecture leads to effective behavior planning and simulation. BehaviorHub also proves beneficial for downstream task development. Our code and dataset will be publicly released.
Poster
Byeonghun Lee · Hyunmin Cho · Honggyu Choi · Soo Min Kang · ILJUN AHN · Kyong Hwan Jin

[ Exhibit Hall I ]

Abstract
Most existing diffusion models have primarily utilized reference images for image-to-image translation rather than for super-resolution (SR). In SR-specific tasks, diffusion methods are only dependent on low-resolution (LR) inputs, limiting their ability to leverage reference information. Prior reference-based diffusion SR methods have demonstrated that incorporating appropriate reference images can significantly enhance reconstruction quality; however, identifying suitable references in real-world scenarios remains a critical challenge. Recently, Retrieval-Augmented Generation (RAG) has emerged as an effective framework that integrates retrieval-based and generation-based information from databases to enhance the accuracy and relevance of response to a given query. Inspired by RAG, we propose an image-based RAG framework (iRAG) for realistic super-resolution. iRAG employs a trainable hashing function to effectively retrieve either real-world or generated reference images given a query LR input. The retrieved patches are then passed to a restoration module, where they are leveraged to augment the retrieved information and generate high-fidelity super-resolved features. Furthermore, to improve the quality of generated references from pre-trained diffusion models, we adopt a hallucination filtering mechanism, leading to overall performance enhancements. Experimental results demonstrate that our approach not only resolves the practical difficulties of reference selection but also delivers superior performance compared to existing diffusion-based and non-diffusion …
Poster
Xiaolong Xu · Lei Zhang · Jiayi Li · Lituan Wang · Yifan Guan · Yu Yan · Leyi Zhang · Hao Song

[ Exhibit Hall I ]

Abstract
Video semantic segmentation aims to assign a class label for each pixel in every video frame. Existing methods predominantly follow the reference-target interaction paradigm, focusing on extracting local temporal contexts while neglecting the integration of global temporal information. Moreover, complex dynamics and varying lighting conditions introduce inter-frame intra-class discrepancies in feature representations, leading to unstable predictions. In this paper, we propose a novel framework, the Dual-Temporal Exemplar Representation Network (DTERN), which utilizes the strong representational capability of cluster centers, i.e., exemplars, to effectively model both local and global temporal information. DTERN consists of two core modules: 1) the Local Temporal Exemplar Module (LTEM), which constructs local exemplars to capture local temporal contexts, ensuring stable and reliable predictions. 2) the Global Temporal Exemplar Module (GTEM), which introduces learnable global exemplars to dynamically model global temporal information, thereby improving the effective consistency of segmentation. Furthermore, we observe that the existing Video Consistency (VC) metric fails to evaluate segmentation accuracy and lacks sensitivity to small-object segmentation. To this end, we propose Video Effective Consistency (VEC) to comprehensively evaluate temporal consistency and segmentation effectiveness. Experiments on VSPW and Cityscape demonstrate that DTERN outperforms state-of-the-art methods. The code is available at https://anonymous.4open.science/r/DTERN/.
Poster
Dat NGUYEN · Marcella Astrid · Anis Kacem · Enjie Ghorbel · Djamila Aouada

[ Exhibit Hall I ]

Abstract
Detecting deepfake videos is highly challenging given the complexity of characterizing spatio-temporal artifacts. Most existing methods rely on binary classifiers trained using real and fake image sequences, therefore hindering their generalization capabilities to unseen generation methods. Moreover, with the constant progress in generative Artificial Intelligence (AI), deepfake artifacts are becoming imperceptible at both the spatial and the temporal levels, making them extremely difficult to capture. To address these issues, we propose a fine-grained deepfake video detection approach called FakeSTormer that enforces the modeling of subtle spatio-temporal inconsistencies while avoiding overfitting. Specifically, we introduce a multi-task learning framework that incorporates two auxiliary branches for explicitly attending artifact-prone spatial and temporal regions. Additionally, we propose a video-level data synthesis strategy that generates pseudo-fake videos with subtle spatio-temporal artifacts, providing high-quality samples and hand-free annotations for our additional branches. Extensive experiments on several challenging benchmarks demonstrate the superiority of our approach compared to recent state-of-the-art methods.
Poster
Ryan Webster · Teddy Furon

[ Exhibit Hall I ]

Abstract
The success of multi-modal foundational models can be partly attributed to their diverse, billions scale training data. By nature, web data contains human faces and descriptions of individuals. Thus, these models pose potentially widespread privacy issues. Recently, identity membership inference attacks (IMIAs) against the CLIP model showed that membership of an individual's name and image within training data can be reliably inferred. This work formalizes the problem of identity extraction, wherein an attacker can reliably extract the names of individuals given their images only. We provide the following contributions (i) we adapt a previous IMIA to the problem of selecting the correct name among a large set and show that the method scales to millions of names (ii) we design an attack that outperforms the adapted baseline (iii) we show that an attacker can extract names via optimization only. To demonstrate the interest of our framework, we show how identity extraction can be used to audit model privacy. Indeed, a family of prominent models that advertise blurring faces before training to protect privacy is still highly vulnerable to attack.
Poster
Tianyi Wang · Shuaicheng Niu · Harry Cheng · xiao zhang · Yinglong Wang

[ Exhibit Hall I ]

Abstract
Suffering from performance bottlenecks in passively detecting high-quality Deepfake images due to the advancement of generative models, proactive perturbations offer a promising approach to disabling Deepfake manipulations by inserting signals into benign images. However, existing proactive perturbation approaches remain unsatisfactory in several aspects: 1) visual degradation due to direct element-wise addition; 2) limited effectiveness against face swapping manipulation; 3) unavoidable reliance on white- and grey-box settings to involve generative models during training. In this study, we analyze the essence of Deepfake face swapping and argue the necessity of protecting source identities rather than target images, and we propose NullSwap, a novel proactive defense approach that cloaks source image identities and nullifies face swapping under a pure black-box scenario. We design an Identity Extraction module to obtain facial identity features from the source image, while a Perturbation Block is then devised to generate identity-guided perturbations accordingly. Meanwhile, a Feature Block extracts shallow-level image features, which are then fused with the perturbation in the Cloaking Block for image reconstruction. Furthermore, to ensure adaptability across different identity extractors in face swapping algorithms, we propose Dynamic Loss Weighting to adaptively balance identity losses. Experiments demonstrate the outstanding ability of our approach to fool various …
Poster
Haiwen Huang · Anpei Chen · Volodymyr Havrylov · Andreas Geiger · Dan Zhang

[ Exhibit Hall I ]

Abstract
Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks.
Poster
Ali Shah Ali · Syed Ahmed Mahmood · Mubin Saeed · Andrey Konin · Zeeshan Zia · Quoc-Huy Tran

[ Exhibit Hall I ]

Abstract
We introduce a novel approach for simultaneous self-supervised video alignment and action segmentation based on a unified optimal transport framework. In particular, we first tackle self-supervised video alignment by developing a fused Gromov-Wasserstein optimal transport formulation with a structural prior, which trains efficiently on GPUs and needs only a few iterations for solving the optimal transport problem. Our single-task method achieves the state-of-the-art performance on multiple video alignment benchmarks and outperforms VAVA, which relies on a traditional Kantorovich optimal transport formulation with an optimality prior. Furthermore, we extend our approach by proposing a unified optimal transport framework for joint self-supervised video alignment and action segmentation, which requires training and storing a single model and saves both time and memory consumption as compared to two different single-task models. Extensive evaluations on several video alignment and action segmentation datasets demonstrate that our multi-task method achieves comparable video alignment yet superior action segmentation results over previous methods in video alignment and action segmentation respectively. Finally, to the best of our knowledge, this is the first work to unify video alignment and action segmentation into a single model.
Poster
Yuheng Shi · Mingjia Li · Minjing Dong · Chang Xu

[ Exhibit Hall I ]

Abstract
Vision transformers have significantly advanced the field of computer vision, offering robust modeling capabilities and global receptive field. However, their high computational demands limit their applicability in processing long sequences. To tackle this issue, State Space Models (SSMs) have gained prominence in vision tasks as they offer linear computational complexity. Recently, State Space Duality (SSD), an improved variant of SSMs, was introduced in Mamba2 to enhance model performance and efficiency. However, the inherent causal nature of SSD/SSMs restricts their applications in non-causal vision tasks. To address this limitation, we introduce Visual State Space Duality (VSSD) model, which has a non-causal format of SSD. Specifically, we propose to discard the magnitude of interactions between the hidden state and tokens while preserving their relative weights, which relieves the dependencies of token contribution on previous tokens. Together with the involvement of multi-scan strategies, we show that the scanning results can be integrated to achieve non-causality, which not only improves the performance of SSD in vision tasks but also enhances its efficiency. We conduct extensive experiments on various benchmarks including image classification, detection, and segmentation, where VSSD surpasses existing state-of-the-art SSM-based models.
Poster
Gen Li · Yutong Chen · Yiqian Wu · KAIFENG ZHAO · Marc Pollefeys · Siyu Tang

[ Exhibit Hall I ]

Abstract
Understanding multimodal signals in egocentric vision, such as RGB video, depth, camera poses, and gaze, is essential for applications in augmented reality, robotics, and human-computer interaction, enabling systems to better interpret the camera wearer’s actions, intentions, and surrounding environment. However, building large-scale egocentric multimodal andmultitask models presents unique challenges. Egocentric data are inherently heterogeneous, with large variations in modality coverage across devices and settings. Generating pseudo-labels for missing modalities, such as gaze or head-mounted camera trajectories, is often infeasible, making standard supervised learning approaches difficult to scale. Furthermore, dynamic camera motion and the complex temporal and spatial structure of first-person video pose additional challenges for the direct application of existing multimodal foundation models.To address these challenges, we introduce a set of efficient temporal tokenizers and propose EgoMLVM, a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose video model for egocentric understanding. This unified design supports multitasking across diverse egocentric perception and synthesis tasks, including gaze prediction, egocentric camera tracking, and monocular depth estimation from egocentric video, and also serves as a generative model for conditional egocentric video synthesis. Across these tasks, EgoMLVM matches or outperforms specialist models while being an order of magnitude faster. …
Poster
Yuanhong Zheng · Ruixuan Yu · Jian Sun

[ Exhibit Hall I ]

Abstract
3D multi-person motion prediction is a highly complex task, primarily due to the dependencies on both individual past movements and the interactions between agents. Moreover, effectively modeling these interactions often incurs substantial computational costs. In this work, we propose a computationally efficient model for multi-person motion prediction by simplifying spatial and temporal interactions. Our approach begins with the design of lightweight dual branches that learn local and global representations for individual and multiple persons separately. Additionally, we introduce a novel cross-level interaction block to integrate the spatial and temporal representations from both branches. To further enhance interaction modeling, we explicitly incorporate the spatial inter-person distance embedding. With above efficient temporal and spatial design, we achieve state-of-the-art performance for multiple metrics on standard datasets of CMU-Mocap, MuPoTS-3D, and 3DPW, while significantly reducing the computational cost.
Poster
Yan Liu · Zehao Chen · Haojie Yan · De Ma · Huajin Tang · Qian Zheng · Gang Pan

[ Exhibit Hall I ]

Abstract
Synthesizing novel space-time views from a monocular video is a highly ill-posed problem, and its effectiveness relies on accurately reconstructing motion and appearance of the dynamic scene.Frame-based methods for novel space-time view synthesis in dynamic scenes rely on simplistic motion assumptions due to the absence of inter-frame cues, which makes them fall in complex motion. Event camera captures inter-frame cues with high temporal resolution, which makes it hold the promising potential to handle high speed and complex motion. However, it is still difficult due to the event noise and sparsity. To mitigate the impact caused by event noise, we propose E-NeMF, which alleviates the impact of event noise with Parametric Motion Representation and mitigates the event sparsity with Flow Prediction Module. Experiments on multiple real-world datasets demonstrate our superior performance in handling high-speed and complex motion.
Poster
Yuxue Yang · Lue Fan · Zuzeng Lin · Feng Wang · Zhaoxiang Zhang

[ Exhibit Hall I ]

Abstract
Traditional animation production decomposes visual elements into discrete layers to enable independent processing for sketching, refining, coloring, and in-betweening. Existing anime generation video methods typically treat animation as a distinct data domain different from real-world videos, lacking fine-grained control at the layer level. To bridge this gap, we introduce LayerAnimate, a novel video diffusion framework with layer-aware architecture that empowers the manipulation of layers through layer-level controls. The development of a layer-aware framework faces a significant data scarcity challenge due to the commercial sensitivity of professional animation assets. To address the limitation, we propose a data curation pipeline featuring Automated Element Segmentation and Motion-based Hierarchical Merging. Through quantitative and qualitative comparisons and user study, we demonstrate that LayerAnimate outperforms current methods in terms of animation quality, control precision, and usability, making it an effective tool for both professional animators and amateur enthusiasts. This framework opens up new possibilities for layer-level animation applications and creative flexibility. The code will be available upon publication.
Poster
Junhao Cheng · Yuying Ge · Yixiao Ge · Jing Liao · Ying Shan

[ Exhibit Hall I ]

Abstract
Recent advancements in image and video synthesis have opened up new promise in generative games. One particularly intriguing application is transforming characters from anime films into interactive, playable entities. This allows players to immerse themselves in the dynamic anime world as their favorite characters for life simulation through language instructions. Such games are defined as ``infinite game'' since they eliminate predetermined boundaries and fixed gameplay rules, where players can interact with the game world through open-ended language and experience ever-evolving storylines and environments. Recently, a pioneering approach for infinite anime life simulation employs large language models (LLMs) to translate multi-turn text dialogues into language instructions for image generation. However, it neglects historical visual context, leading to inconsistent gameplay. Furthermore, it only generates static images, failing to incorporate the dynamics necessary for an engaging gaming experience. In this work, we propose AnimeGamer, which is built upon Multimodal Large Language Models (MLLMs) to generate each game state, including dynamic animation shots that depict character movements and updates to character states, as illustrated in Figure 1. We introduce novel action-aware multimodal representations to represent animation shots, which can be decoded into high-quality video clips using a video diffusion model. By taking historical animation …
Poster
Jiaxin Lu · Chun-Hao Huang · Uttaran Bhattacharya · Qixing Huang · Yi Zhou

[ Exhibit Hall I ]

Abstract
We present Human Motions with Objects (HUMOTO), a high-fidelity dataset of human-object interactions for motion generation, computer vision, and robotics applications. Featuring 736 sequences (7,875 seconds at 30 fps), HUMOTO captures interactions with 63 precisely modeled objects and 72 articulated parts. Our innovations include a scene-driven LLM scripting pipeline creating complete, purposeful tasks with natural progression, and a mocap-and-camera recording setup to effectively handle occlusions. Spanning diverse activities from cooking to outdoor picnics, HUMOTO preserves both physical accuracy and logical task flow. Professional artists rigorously clean and verify each sequence, minimizing foot sliding and object penetrations. We also provide benchmarks compared to other datasets. HUMOTO’s comprehensive full-body motion and simultaneous multi-object interactions address key data-capturing challenges and provide opportunities to advance realistic human-object interaction modeling across research domains with practical applications in animation, robotics, and embodied AI systems. Project: https://anonymous.4open.science/w/humoto-1782/ .
Poster
Liming Jiang · Qing Yan · Yumin Jia · Zichuan Liu · Hao Kang · Xin Lu

[ Exhibit Hall I ]

Abstract
Achieving flexible and high-fidelity identity-preserved image generation remains formidable, particularly with advanced Diffusion Transformers (DiTs) like FLUX. We introduce InfiniteYou (InfU), one of the earliest robust frameworks leveraging DiTs for this task. InfU addresses significant issues of existing methods, such as insufficient identity similarity, poor text-image alignment, and low generation quality and aesthetics. Central to InfU is InfuseNet, a component that injects identity features into the DiT base model via residual connections, enhancing identity similarity while maintaining generation capabilities. A multi-stage training strategy, including pretraining and supervised fine-tuning (SFT) with synthetic single-person-multiple-sample (SPMS) data, further improves text-image alignment, ameliorates image quality, and alleviates face copy-pasting. Extensive experiments demonstrate that InfU achieves state-of-the-art performance, surpassing existing baselines. In addition, the plug-and-play design of InfU ensures compatibility with various existing methods, offering a valuable contribution to the broader community.
Poster
Dong Li · Chunhui Luo · Yuanfei Bao · Gang Yang · Jie Xiao · Xueyang Fu · Zheng-Jun Zha

[ Exhibit Hall I ]

Abstract
Pansharpening aims to generate high-resolution multispectral (MS) images by fusing panchromatic (PAN) images with corresponding low-resolution MS images. However, many existing methods struggle to fully capture spatial and spectral interactions, limiting their effectiveness. To address this, we propose a novel quaternion-based spatial-spectral interaction network that enhances pansharpening by leveraging the compact representation capabilities of quaternions for high-dimensional data. Our method consists of three key components: the quaternion global spectral interaction branch, the quaternion local spatial structure awareness branch, and the quaternion spatial-spectral interaction branch. The first applies the quaternion Fourier transform to convert multi-channel features into the frequency domain as a whole, enabling global information interaction while preserving inter-channel dependencies, which aids spectral fidelity. The second uses a customized spatial quaternion representation, combined with a window-shifting strategy, to maintain local spatial dependencies while promoting spatial interactions, which helps inject spatial details. The last integrates the two pathways within the quaternion framework to enrich spatial-spectral interactions for richer representations. By utilizing quaternion’s multi-dimensional representation and parameter-sharing properties, our method achieves a more compact and efficient cross-resolution, multi-band information integration, significantly improving the quality of the fused image. Extensive experiments validate the proposed method’s effectiveness and its superior performance over current SOTA …
Poster
Peng Chen · Pi Bu · Yingyao Wang · Xinyi Wang · Ziming Wang · Jie Guo · Yingxiu Zhao · Qi Zhu · Jun Song · Siran Yang · Jiamang Wang · Bo Zheng

[ Exhibit Hall I ]

Abstract
Recent advances in Vision-Language-Action models (VLAs) have expanded the capabilities of embodied intelligence. However, significant challenges remain in real-time decision-making in complex 3D environments, which demand second-level responses, high-resolution perception, and tactical reasoning under dynamic conditions. To advance the field, we introduce CombatVLA, an efficient VLA model optimized for combat tasks in 3D action role-playing games(ARPGs). Specifically, our CombatVLA is a 3B model trained on video-action pairs collected by an action tracker, where the data is formatted as action-of-thought (AoT) sequences. Thereafter, CombatVLA seamlessly integrates into an action execution framework, allowing efficient inference through our truncated AoT strategy. Experimental results demonstrate that CombatVLA not only outperforms all existing models on the combat understanding benchmark but also achieves a 50-fold acceleration in game combat. Moreover, it has a higher task success rate than human players. We will open-sourcing all resources, including the action tracker, dataset, model weights, training code, and framework implementation.
Poster
Pinxin Liu · Luchuan Song · Junhua Huang · Haiyang Liu · Chenliang Xu

[ Exhibit Hall I ]

Abstract
Generating full-body human gestures based on speech signals remains challenges on quality and speed. Existing approaches model different body regions such as body, legs and hands separately, which fail to capture the spatial interactions between them and result in unnatural and disjointed movements. Additionally, their autoregressive/diffusion-based pipelines show slow generation speed due to dozens of inference steps. To address these two challenges, we propose \textbf{GestureLSM}, a flow-matching-based approach for Co-Speech Gesture Generation with spatial-temporal modeling. Our method i) explicitly model the interaction of tokenized body regions through spatial and temporal attention, for generating coherent full-body gestures. ii) introduce the flow matching to enable more efficient sampling by explicitly modeling the latent velocity space. To overcome the suboptimal performance of flow matching baseline, we propose latent shortcut learning and beta distribution time stamp sampling during training to enhance gesture synthesis quality and accelerate inference. Combining the spatial-temporal modeling and improved flow matching-based framework, GestureLSM achieves state-of-the-art performance on BEAT2 while significantly reducing inference time compared to existing methods, highlighting its potential for enhancing digital humans and embodied agents in real-world applications.
Poster
Luca Collorone · Matteo Gioia · Massimiliano Pappa · Paolo Leoni · Giovanni Ficarra · Or Litany · Indro Spinelli · Fabio Galasso

[ Exhibit Hall I ]

Abstract
Intention drives human movement in complex environments, but such movement can only happen if the surrounding context supports it.Despite the intuitive nature of this mechanism, existing research has not yet provided tools to evaluate the alignment between skeletal movement (motion), intention (text), and the surrounding context (scene).In this work, we introduce MonSTeR, the first MOtioN-Scene-TExt Retrieval model. Inspired by the modeling of higher-order relations, MonSTeR constructs a unified latent space by leveraging unimodal and cross-modal representations.This allows MonSTeR to capture the intricate dependencies between modalities, enabling flexible but robust retrieval across various tasks.Our results show that MonSTeR significantly outperforms models that rely solely on unimodal representations. Furthermore, we validate the alignment of our retrieval scores with human preferences through a dedicated user study. We demonstrate the versatility of MonSTeR's latent space on zero-shot in-Scene Object Placement and Motion Captioning. Code and pre-trained models will be made publicly available.
Poster
Kyusu Ahn · JiSoo Kim · Sangik Lee · HyunGyu Lee · Byeonghyun Ko · Chanwoo Park · Jaejin Lee

[ Exhibit Hall I ]

Abstract
Under Display Camera (UDC) is an advanced imaging system that places a digital camera lens underneath a display panel, effectively concealing the camera. However, the display panel significantly degrades captured images or videos, introducing low transmittance, blur, noise, and flare issues. Tackling such issues is challenging because of the complex degradation of UDCs, including diverse flare patterns. Despite extensive research on UDC images and their restoration models, studies on videos have yet to be significantly explored. While two UDC video datasets exist, they primarily focus on unrealistic or synthetic UDC degradation rather than real-world UDC degradation. In this paper, we propose a real-world UDC video dataset called UDC-VIX. Unlike existing datasets, only UDC-VIX exclusively includes human motions that target facial recognition. We propose a video-capturing system to simultaneously acquire non-degraded and UDC-degraded videos of the same scene. Then, we align a pair of captured videos frame by frame, using discrete Fourier transform (DFT). We compare UDC-VIX with six representative UDC still image datasets and two existing UDC video datasets. Using six deep-learning models, we compare UDC-VIX and an existing synthetic UDC video dataset. The results indicate the ineffectiveness of models trained on earlier synthetic UDC video datasets, as they do …
Poster
Yuxuan Wang · Xuanyu Yi · Haohan Weng · Qingshan Xu · xiaokang wei · Xianghui Yang · Chunchao Guo · Long Chen · Hanwang Zhang

[ Exhibit Hall I ]

Abstract
Triangle meshes are fundamental to 3D applications. Current automatic mesh generation methods typically rely on intermediate representations that lack the continuous surface quality inherent to meshes. Converting these representations into meshes produces dense, suboptimal outputs. Although recent autoregressive approaches demonstrate promise in directly modeling mesh vertices and faces, they are constrained by the limitation in face count, scalability, and structural fidelity.To address these challenges, we propose Nautilus, a locality-aware autoencoder for artist-like mesh generation that leverages the local properties of manifold meshes to achieve structural fidelity and efficient representation. Our approach introduces a novel tokenization algorithm that preserves face proximity relationships and compresses sequence length through locally shared vertices and edges, enabling the generation of meshes with an unprecedented scale of up to 5,000 faces. Furthermore, we develop a Dual-stream Point Conditioner that captures fine-grained geometric features, ensuring global consistency and local structural fidelity.Extensive experiments demonstrate that Nautilus significantly outperforms state-of-the-art methods in generation quality.
Poster
Qiang Zhu · Yuxuan Jiang · Shuyuan Zhu · Fan Zhang · David Bull · Bing Zeng

[ Exhibit Hall I ]

Abstract
Blind video super-resolution (BVSR) is a low-level vision task which aims to generate high-resolution videos from low-resolution counterparts in unknown degradation scenarios. Existing approaches typically predict blur kernels that are spatially invariant in each video frame or even the entire video. These methods do not consider potential spatio-temporal varying degradations in videos, resulting in suboptimal BVSR performance. In this context, we propose a novel BVSR model based on Implicit Kernels, BVSR-IK, which constructs a multi-scale kernel dictionary parameterized by implicit neural representations. It also employs a newly designed recurrent Transformer to predict the coefficient weights for accurate filtering in both frame correction and feature alignment. Experimental results have demonstrated the effectiveness of the proposed BVSR-IK, when compared with four state-of-the-art BVSR models on three commonly used datasets, with BVSR-IK outperforming the second best approach, FMA-Net, by up to 0.59 dB in PSNR. Source code will be available at https://github.com.
Poster
Lu Liu · Huiyu Duan · Qiang Hu · Liu Yang · Chunlei Cai · Tianxiao Ye · Huayu Liu · Xiaoyun Zhang · Guangtao Zhai

[ Exhibit Hall I ]

Abstract
Recent artificial intelligence (AI) generative models have demonstrated remarkable capabilities in image production, and have been widely applied to face image generation, customization, and restoration. However, many AI-generated faces (AIGFs) still suffer from issues such as unique distortions, unrealistic details, and unexpected identity shifts, underscoring the need for a comprehensive quality evaluation method for AIGFs. To this end, we introduce **FaceQ**, the first comprehensive AI-generated Face image database with fine-grained Quality annotations aligned with human preferences, which consists of 12K images and 491K ratings across multiple dimensions. Using the FaceQ database, we establish **F-Bench**, a benchmark for comparing and evaluating face generation, customization, and restoration models, highlighting strengths and weaknesses across various prompts and evaluation dimensions. Additionally, we assess the performance of existing image quality assessment (IQA) methods on FaceQ, and further propose a large multimodal model (LMM) based Face quality Evaluator (**F-Eval**) to accurately assess the multi-dimensional quality of generated faces in a one-for-all manner. Extensive experimental results demonstrate the state-of-the-art performance of our F-Eval. FaceQ, F-Bench, and F-Eval will be publicly available upon publication.
Poster
Qingyu Shi · Jianzong Wu · Jinbin Bai · Lu Qi · Jiangning Zhang · Yunhai Tong · Xiangtai Li

[ Exhibit Hall I ]

Abstract
The motion transfer task involves transferring motion from a source video to newly generated videos, requiring the model to decouple motion from appearance. Previous diffusion-based methods primarily rely on separate spatial and temporal attention mechanisms within 3D U-Net. In contrast, state-of-the-art Diffusion Transformer (DiT) models use 3D full attention, which does not explicitly separate temporal and spatial information. Thus, the interaction between spatial and temporal dimensions makes decoupling motion and appearance more challenging for DiT models. In this paper, we propose DeT, a method that adapts DiT models to improve motion transfer ability. Our approach introduces a simple yet effective temporal kernel to smooth DiT features along the temporal dimension, facilitating the decoupling of foreground motion from background appearance. Meanwhile, the temporal kernel effectively captures temporal variations in DiT features, which are closely related to motion. Moreover, we introduce explicit supervision along trajectories in the latent feature space to further enhance motion consistency. Additionally, we present MTBench, a general and challenging benchmark for motion transfer. We also introduce a hybrid motion fidelity metric which consider both the global and local similarity of motion. Therefore our work provides a more comprehensive evaluation than previous works. Extensive experiments on MTBench demonstrate that …
Poster
Yusheng Dai · Chenxi Wang · Chang Li · Chen Wang · Kewei Li · Jun Du · Lei Sun · Jianqing Gao · Ruoyu Wang · Jiefeng Ma

[ Exhibit Hall I ]

Abstract
This paper introduces Swap Forward (SaFa), a modality-agnostic and efficient method to generate seamless and coherence long spectrum and panorama through latent swap joint diffusion across multi-views. We first investigate the spectrum aliasing problem in spectrum-based audio generation caused by existing joint diffusion methods. Through a comparative analysis of the VAE latent representation of Mel-spectra and RGB images, we identify that the failure arises from excessive suppression of high-frequency components during the spectrum denoising process due to the averaging operator. To address this issue, we propose Self-Loop Latent Swap, a frame-level bidirectional swap applied to the overlapping region of adjacent views. Leveraging stepwise differentiated trajectories of adjacent subviews, this swap operator adaptively enhances high-frequency components and avoid spectrum distortion. Furthermore, to improve global cross-view consistency in non-overlapping regions, we introduce Reference-Guided Latent Swap, a unidirectional latent swap operator that provides a centralized reference trajectory to synchronize subview diffusions. By refining swap timing and intervals, we can achieve a cross-view similarity-diversity balance in a forward-only manner. Quantitative and qualitative experiments demonstrate that SaFa significantly outperforms existing joint diffusion methods and even training-based methods in audio generation using both U-Net and DiT models, along with effective longer length adaptation. It also adapts …
Poster
Heyan Liu · Jianing Sun · Jun Liu · Xi-Le Zhao · Tingting WU · Tieyong Zeng

[ Exhibit Hall I ]

Abstract
Blind deblurring is an ill-posed inverse problem that involves recovering both the clear image and the blur kernel from a single blurry image. In real photography, longer exposure times result in lots of noise in the blurry image. Although existing blind deblurring methods produce satisfactory results on blurred images with little or no noise, they struggle to handle high noise levels. Strong noise compromises the accuracy of the estimated kernel and significantly reduces the quality of the deblurring results. To address this challenge, we propose a Residual Guidance Strategy (RGS) to suppress the influence of noise. Our method leverages adjacent coarser-scale information in the image pyramid to guide the blur kernel estimation in the current scale. Therefore, for blurred images with unknown noise levels and types, our method still estimates more accurate blur kernels, which are essential for subsequent non-blind restoration. Extensive experiments on both synthetic and real datasets have demonstrated that our method consistently outperforms numerous state-of-the-art methods under high levels of noise quantitatively and qualitatively.
Poster
WENXUAN WU · ruowen qu · Zhongliang Liu · Zhuoyan Dai · Dongzi Shi · Sijin Yu · Tong Xiong · Shiping Liu · Xiangmin Xu · Xiaofen Xing · Xin Zhang

[ Exhibit Hall I ]

Abstract
Diffeomorphic-based cortical surface reconstruction typically involves a series of deformation processes to extract the cerebral cortex from brain magnetic resonance images (MRI). While most methods are designed for adult brains using Neural Ordinary Differential Equations (NODE) with fixed step sizes, the neonatal brain, which exhibits dramatic changes in cortical folding patterns early in life, requires a more adaptive approach. To address this, we develop a dual-task framework to directly characterize the brain development trajectory through processes of cortical surface reconstruction. For white matter (inner surfaces), we employ an Age-Conditioned ODE with adaptive step sizes. It is initially trained on a limited set of longitudinal paired data to establish a coarse trajectory, which is then refined through sample training of single-point data and knowledge distillation. For the pial surfaces (outer surfaces), we position the midthickness surfaces as intermediates and employ a cycle-consistent semi-supervised training strategy to depict a coherent brain development trajectory between the inner and outer surfaces. Our approach is the first to achieve precise developmental prediction directly on triangular meshes. Furthermore, by enhancing interpretability at each stage of the deformation process, this approach improves the applicability of diffeomorphic-based methods. The proposed method has demonstrated state-of-the-art performance in modeling developmental …
Poster
Yuxuan Luo · Zhengkun Rong · Lizhen Wang · Longhao Zhang · Tianshu Hu

[ Exhibit Hall I ]

Abstract
While recent image-based human animation methods achieve realistic body and facial motion synthesis, critical gaps remain in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence, which leads to their lower expressiveness and robustness. We propose a diffusion transformer (DiT) based framework, HERA, with hybrid guidance to overcome these limitations. For motion guidance, our hybrid control signals that integrate implicit facial representations, 3D head spheres, and 3D body skeletons achieve robust control of facial expressions and body movements, while producing expressive and identity-preserving animations.For scale adaptation, to handle various body poses and image scales ranging from portraits to full-body views, we employ a progressive training strategy using data with varying resolutions and scales.For appearance guidance, we integrate motion patterns from sequential frames with complementary visual references, ensuring long-term temporal coherence for unseen regions during complex movements.Experiments demonstrate that our method outperforms the state-of-the-art works, delivering expressive results for portraits, upper-body, and full-body generation with robust long-term consistency.
Poster
Yudong Jin · Sida Peng · Xuan Wang · Tao Xie · Zhen Xu · Yifan Yang · Yujun Shen · Hujun Bao · Xiaowei Zhou

[ Exhibit Hall I ]

Abstract
This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion models to generate videos at novel viewpoints. However, the generated videos from these models often lack spatio-temporal consistency, thus degrading view synthesis quality. In this paper, we propose a novel sliding iterative denoising process to enhance the spatio-temporal consistency of the 4D diffusion model. Specifically, we define a latent grid in which each latent encodes the image, camera pose and human pose for a certain viewpoint and timestamp, then alternately denoising the latent grid along spatial and temporal dimensions with a sliding window, and finally decode the videos at target viewpoints from the corresponding denoised latents. Through the iterative sliding, information flows sufficiently across the latent grid, allowing the diffusion model to obtain a large receptive field and thus enhance the 4D consistency of the output, while making the memory consumption affordable. The experiments on the DNA-Rendering and ActorsHQ datasets demonstrate that our method is able to synthesize high-quality and consistent novel-view videos and outperforms the existing methods by a large margin. Our code and dataset will be released.
Poster
Jiangran Lyu · Ziming Li · Xuesong Shi · Chaoyi Xu · Yizhou Wang · He Wang

[ Exhibit Hall I ]

Abstract
Nonprehensile manipulation is crucial for handling objects that are too thin, large, or otherwise ungraspable in unstructured environments. While conventional planning-based approaches struggle with complex contact modeling, learning-based methods have recently emerged as a promising alternative. However, existing learning-based approaches face two major limitations: they heavily rely on multi-view cameras and precise pose tracking, and they fail to generalize across varying physical conditions, such as changes in object mass and table friction. To address these challenges, we propose the Dynamics-Adaptive World Action Model (DyWA), a novel framework that enhances action learning by jointly predicting future states while adapting to dynamics variations based on historical trajectories. By unifying the modeling of geometry, state, physics, and robot actions, DyWA enables more robust policy learning under partial observability.Compared to baselines, our method improves the success rate by 31.5\% using only single-view point cloud observations in the simulation. Furthermore, DyWA achieves an average success rate of 68\% in real-world experiments, demonstrating its ability to generalize across diverse object geometries, adapt to varying table friction, and robustness in challenging scenarios such as half-filled water bottles and slippery surfaces.
Poster
Jinseok Bae · Inwoo Hwang · Young-Yoon Lee · Ziyu Guo · Joseph Liu · Yizhak Ben-Shabat · Young Min Kim · Mubbasir Kapadia

[ Exhibit Hall I ]

Abstract
Recent advances in motion diffusion models have led to remarkable progress in diverse motion generation tasks, including text-to-motion synthesis.However, existing approaches represent motions as dense frame sequences, requiring the model to process redundant or less informative frames.The processing of dense animation frames imposes significant training complexity, especially when learning intricate distributions of large motion datasets even with modern neural architectures. This severely limits the performance of generative motion models for downstream tasks.Inspired by professional animators who mainly focus on sparse keyframes, we propose a novel diffusion framework explicitly designed around sparse and geometrically meaningful keyframes.Our method reduces computation by masking non-keyframes and efficiently interpolating missing frames. We dynamically refine the keyframe mask during inference to prioritize informative frames in later diffusion steps.Extensive experiments show that our approach consistently outperforms state-of-the-art methods in text alignment and motion realism, while also effectively maintaining high performance at significantly fewer diffusion steps.We further validate the robustness of our framework by using it as a generative prior and adapting it to different downstream tasks. Source code and pre-trained models will be released upon acceptance.
Poster
Xiaoxi Liang · Yanbo Fan · Qiya Yang · Xuan Wang · Wei Gao · Ge Li

[ Exhibit Hall I ]

Abstract
In this work, we investigate the generation of high-fidelity, audio-driven 3D Gaussian talking heads from monocular videos. We present DGTalker, an innovative framework designed for real-time, high-fidelity, and 3D-aware talking head synthesis. By leveraging Gaussian generative priors and treating the task as a latent space navigation problem, our method effectively alleviates the lack of 3D information and the low-quality detail reconstruction caused by overfitting to training views in monocular videos, which has been a longstanding challenge in existing 3DGS-based approaches. To ensure precise lip synchronization and nuanced expression control, we propose a disentangled latent space navigation framework that independently models lip motion and upper-face expressions. Additionally, we introduce an effective masked cross-view supervision strategy to enable robust learning within the disentangled latent space. We conduct extensive experiments and demonstrate that DGTalker surpasses current state-of-the-art methods in visual quality, motion accuracy, and controllability.
Poster
Yating Wang · Haoyi Zhu · Mingyu Liu · Jiange Yang · Hao-Shu Fang · Tong He

[ Exhibit Hall I ]

Abstract
In this paper, we introduce an innovative vector quantization based action tokenizer built upon the largest-scale action trajectory dataset to date, leveraging over 100 times more data than previous approaches. This extensive dataset enables our tokenizer to capture rich spatiotemporal dynamics, resulting in a model that not only accelerates inference but also generates smoother and more coherent action outputs. Once trained, the tokenizer can be seamlessly adapted to a wide range of downstream tasks in a zero-shot manner, from short-horizon reactive behaviors to long-horizon planning. A key finding of our work is that the domain gap between synthetic and real action trajectories is marginal, allowing us to effectively utilize a vast amount of synthetic data during training without compromising real-world performance. To validate our approach, we conducted extensive experiments in both simulated environments and on real robotic platforms. The results demonstrate that as the volume of synthetic trajectory data increases, the performance of our tokenizer on downstream tasks improves significantly—most notably, achieving up to a 30\% higher success rate on two real-world tasks in long-horizon scenarios.These findings highlight the potential of our action tokenizer as a robust and scalable solution for real-time embodied intelligence systems, paving the way for more …
Poster
Zhiqi Pang · Chunyu Wang · Lingling Zhao · Junjie Wang

[ Exhibit Hall I ]

Abstract
Color variations, a key challenge in the unsupervised visible-infrared person re-identification (UVI-ReID) task, have garnered significant attention. While existing UVI-ReID methods have made substantial efforts during the optimization phase to enhance the model’s robustness to color variations, they often overlook the impact of color variations on the acquisition of pseudo-labels. To address this, in this paper, we focus on improving the robustness of pseudo-labels to color variations through data augmentation and propose an augmented and softened matching (ASM) method. Specifically, we first develop the cross-modality augmented matching (CAM) module, which performs channel augmentation on visible images to generate augmented images. Then, based on the fusion of the visible-infrared and augmented-infrared centroid similarity matrices, CAM establishes cross-modality correspondences that are robust to color variations. To increase training stability, we design a soft-labels momentum update (SMU) strategy, which converts traditional one-hot labels into soft-labels through momentum updates, thus adapting to CAM. During the optimization phase, we introduce the cross-modality soft contrastive loss and cross-modality hard contrastive loss to promote modality-invariant learning from the perspectives of shared and diversified features, respectively. Extensive experimental results validate the effectiveness of the proposed method, showing that ASM not only outperforms state-of-the-art unsupervised methods but also competes …
Poster
Qiangqiang Wu · Yi Yu · Chenqi Kong · Ziquan Liu · Jia Wan · Haoliang Li · Alex Kot · Antoni Chan

[ Exhibit Hall I ]

Abstract
With the rise of social media, vast amounts of user-uploaded videos (e.g., YouTube) are utilized as training data for Visual Object Tracking (VOT). However, the VOT community has largely overlooked video data-privacy issues, as many private videos have been collected and used for training commercial models without authorization. To alleviate these issues, this paper presents the first investigation on preventing personal video data from unauthorized exploitation by deep trackers. Existing methods for preventing unauthorized data use primarily focus on image-based tasks (e.g., image classification), directly applying them to videos reveals several limitations, including inefficiency, limited effectiveness, and poor generalizability. To address these issues, we propose a novel generative framework for generating Temporal Unlearnable Examples (TUEs), and whose efficient computation makes it scalable for usage on large-scale video datasets. The trackers trained w/ TUEs heavily rely on unlearnable noises for temporal matching, ignoring the original data structure and thus ensuring training video data-privacy. To enhance the effectiveness of TUEs, we introduce a temporal contrastive loss, which further corrupts the learning of existing trackers when using our TUEs for training. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in video data-privacy protection, with strong transferability across VOT models, datasets, and temporal …
Poster
Wenhan Wu · Zhishuai Guo · Chen Chen · Hongfei Xue · Aidong Lu

[ Exhibit Hall I ]

Abstract
Zero-shot skeleton-based action recognition aims to develop models capable of identifying actions beyond the categories encountered during training. Previous approaches have primarily focused on aligning visual and semantic representations but often overlooked the importance of fine-grained action patterns in the semantic space (e.g., the hand movements in drinking water and brushing teeth). To address these limitations, we propose a Frequency-Semantic Enhanced Variational Autoencoder (FS-VAE) to explore the skeleton semantic representation learning with frequency decomposition. FS-VAE consists of three key components: 1) a frequency-based enhancement module with high- and low-frequency adjustments to enrich the skeletal semantics learning and improve the robustness of zero-shot action recognition; 2) a semantic-based action description with multilevel alignment to capture both local details and global correspondence, effectively bridging the semantic gap and compensating for the inherent loss of information in skeleton sequences; 3) a calibrated cross-alignment loss that enables valid skeleton-text pairs to counterbalance ambiguous ones, mitigating discrepancies and ambiguities in skeleton and text features, thereby ensuring robust alignment. Evaluations on the benchmarks demonstrate the effectiveness of our approach, validating that frequency-enhanced semantic features enable robust differentiation of visually and semantically similar action clusters, thereby improving zero-shot action recognition.
Poster
Jiacheng Li · Feiran Li · Daisuke Iso

[ Exhibit Hall I ]

Abstract
In recent years, neural networks have achieved significant progress in offline image processing. However, in online scenarios, particularly in on-chip implementations, memory usage emerges as a critical bottleneck due to the limited memory resources of integrated image processors. In this study, we focus on reducing the memory footprint of neural networks for on-chip image processing by optimizing network design for efficient memory utilization. Specifically, we consider a typical scenario in which images outputted from an image sensor are processed sequentially using line buffers in a line-by-line manner. This setting necessitates the modeling of both intra-line and inter-line correlations—capturing dependencies among pixels within a single line group and across different line groups, respectively.To model intra-line correlations, we propose a progressive feature enhancement strategy, where line pixels are processed with expanding strip convolutions in multiple stages.For inter-line correlation modeling, we introduce a hierarchical line buffer formulation, where features extracted from previous lines are incrementally reused and compressed across multiple hierarchical levels.Comprehensive experiments on various image processing tasks, including RAW denoising, Gaussian denoising, and super-resolution, demonstrate that the proposed method achieves a superior trade-off between performance and memory efficiency than previous solutions, e.g., up to 1dB PSNR gain in RAW denoising at one-fifth …
Poster
shiduo zhang · Zhe Xu · Peiju Liu · Xiaopeng Yu · Qinghui Gao · Yuan Li · Zhaoye Fei · Zhangyue Yin · Zuxuan Wu · Yu-Gang Jiang · Xipeng Qiu

[ Exhibit Hall I ]

Abstract
General-purposed embodied agents are designed to understand the users' natural instructions or intentions and act precisely to complete universal tasks. Recently, methods based on foundation models especially Vision-Language-Action models (VLAs) have shown a substantial potential to solve language-conditioned manipulation (LCM) tasks well. However, existing benchmarks do not adequately meet the needs of VLAs and relative algorithms. To better define such general-purpose tasks in the context of LLMs and advance the research in VLAs, we present VLABench, an open-source benchmark for evaluating universal LCM task learning. VLABench provides 100 carefully designed categories of tasks, with strong randomization in each category of task and a total of 2000+ objects. VLABench stands out from previous benchmarks in four key aspects: 1) tasks requiring world knowledge and common sense transfer, 2) natural language instructions with implicit human intentions rather than templates, 3) long-horizon tasks demanding multi-step reasoning, and 4) evaluation of both action policies and language model capabilities. The benchmark assesses multiple competencies including understanding of mesh\&texture, spatial relationship, semantic instruction, physical laws, knowledge transfer and reasoning, etc. To support the downstream finetuning, we provide high-quality training data collected via an automated framework incorporating heuristic skills and prior information. The experimental results indicate that …
Poster
Yibing Wei · Samuel Church · Victor Suciu · Jinhong Lin · Cheng-En Wu · Pedro Morgado

[ Exhibit Hall I ]

Abstract
Video data inherently captures rich, dynamic contexts that reveal objects in varying poses, interactions, and state transitions, offering rich potential for unsupervised visual representation learning.However, existing natural video datasets are not well-suited for effective object representation learning due to their lack of object-centricity and class diversity. To address these challenges, we introduce TrackVerse, a novel large-scale video dataset for learning object representations. TrackVerse features diverse, common objects tracked over time, capturing their evolving states. To leverage temporal dynamics in TrackVerse, we extend contrastive learning with a variance-aware predictor that conditions on data augmentations, enabling models to learn state-aware representations.Extensive experiments demonstrate that representations learned from TrackVerse with variance-aware contrastive learning significantly outperform those from non-object-centric natural video and static image datasets across multiple downstream tasks including object/attributie recognition, action recognition and video instance segmentation, highlighting the rich semantic and state content in TrackVerse feature.
Poster
Hyeonho Jeong · Suhyeon Lee · Jong Ye

[ Exhibit Hall I ]

Abstract
We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video. Unlike mainstream approaches that train multi-view video diffusion models on large-scale 4D datasets, our method reframes the multi-view video generation task as video-to-videos translation, leveraging publicly available image and video diffusion priors. In essence, Reangle-A-Video operates in two stages. (1) Multi-View Motion Learning: An image-to-video diffusion transformer is synchronously fine-tuned in a self-supervised manner to distill view-invariant motion from a set of warped videos. (2) Multi-View Consistent Image-to-Images Translation: The first frame of the input video is warped and inpainted into various camera perspectives under an inference-time cross-view consistency guidance using DUSt3R, generating multi-view consistent starting images. Extensive experiments on static view transport and dynamic camera control show that Reangle-A-Video surpasses existing methods, establishing a new solution for multi-view video generation. We will publicly release our code and data. Anonymous project page: https://anony1anony2.github.io/
Poster
Zhen Wu · Jiaman Li · Pei Xu · Karen Liu

[ Exhibit Hall I ]

Abstract
Intelligent agents must autonomously interact with the environments to perform daily tasks based on human-level instructions. They need a foundational understanding of the world to accurately interpret these instructions, along with precise low-level movement and interaction skills to execute the derived actions. In this work, we propose the first complete system for synthesizing physically plausible, long-horizon human-object interactions for object manipulation in contextual environments, driven by human-level instructions. We leverage large language models (LLMs) to interpret the input instructions into detailed execution plans. Unlike prior work, our system is capable of generating detailed finger-object interactions, in seamless coordination with full-body movements. We also train a policy to track generated motions in physics simulation via reinforcement learning (RL) to ensure physical plausibility of the motion. Our experiments demonstrate the effectiveness of our system in synthesizing realistic interactions with diverse objects in complex environments, highlighting its significant potential for real-world applications.
Poster
Pengfei Zhang · Pinxin Liu · Pablo Garrido · Hyeongwoo Kim · Bindita Chaudhuri

[ Exhibit Hall I ]

Abstract
Current human motion synthesis frameworks rely on global action descriptions, creating a modality gap that limits both motion understanding and generation capabilities. A single coarse description, such as "run", fails to capture essential details like variations in speed, limb positioning, and kinematic dynamics, leading to significant ambiguities between text and motion modalities. To address this challenge, we introduce \textbf{KinMo}, a unified framework built on a hierarchical describable motion representation that extends beyond global action by incorporating kinematic group movements and their interactions.We design an automated annotation pipeline to generate high-quality, fine-grained descriptions for this decomposition, resulting in the KinMo dataset. To leverage these structured descriptions, we propose Hierarchical Text-Motion Alignment, improving spatial understanding by integrating additional motion details. Furthermore, we introduce a coarse-to-fine generation procedure to demonstrate how enhanced spatial understanding benefits motion synthesis. Experimental results show that KinMo significantly improves motion understanding, demonstrated by enhanced text-motion retrieval performance and enabling more fine-grained motion generation and editing capabilities.
Poster
Taehoon Kim · Jongwook Choi · Yonghyun Jeong · Haeun Noh · Jaejun Yoo · Seungryul Baek · Jongwon Choi

[ Exhibit Hall I ]

Abstract
We introduce a deepfake video detection approach that exploits pixel-wise temporal inconsistencies, which traditional spatial frequency-based detectors often overlook. The traditional detectors represent temporal information merely by stacking spatial frequency spectra across frames, resulting in the failure to detect pixel-wise temporal artifacts. Our approach performs a 1D Fourier transform on the time axis for each pixel, extracting features highly sensitive to temporal inconsistencies, especially in areas prone to unnatural movements. To precisely locate regions containing the temporal artifacts, we introduce an attention proposal module trained in an end-to-end manner. Additionally, our joint transformer module effectively integrates pixel-wise temporal frequency features with spatio-temporal context features, expanding the range of detectable forgery artifacts. Our framework represents a significant advancement in deepfake video detection, providing robust performance across diverse and challenging detection scenarios.
Poster
Lei-lei Li · Jianwu Fang · Junbin Xiao · Shanmin Pang · Hongkai Yu · Chen Lv · Jianru Xue · Tat-Seng Chua

[ Exhibit Hall I ]

Abstract
Egocentricly comprehending the causes and effects of car accidents is crucial for the safety of self-driving cars, and synthesizing causal-entity reflected accident videos can facilitate the capability test to respond to unaffordable accidents in reality. However, incorporating causal relations as seen in real-world videos into synthetic videos remains challenging. This work argues that precisely identifying the accident participants and capturing their related behaviors are of critical importance. In this regard, we propose a novel diffusion model Causal-VidSyn for synthesizing egocentric traffic accident videos. To enable causal entity grounding in video diffusion, Causal-VidSyn leverages the cause descriptions and driver fixations to identify the accident participants and behaviors, facilitated by accident reason answering and gaze-conditioned selection modules. To support Causal-VidSyn, we further construct DriveGaze, the largest driver gaze dataset (with 1.54M frames of fixations) in driving accident scenarios. Extensive experiments show that Causal-VidSyn surpasses state-of-the-art video diffusion models in terms of frame quality and causal sensitivity in various tasks, including accident video content editing, normal-to-accident video diffusion, and text-to-video generation.
Poster
Haodong Zhu · Wenhao Dong · Linlin Yang · Hong Li · Yuguang Yang · Yangyang Ren · Qingcheng Zhu · Zichao Feng · CHANGBI LI · Shaohui Lin · Runqi Wang · Xiaoyan Luo · Baochang Zhang

[ Exhibit Hall I ]

Abstract
Leveraging the complementary characteristics of visible (RGB) and infrared (IR) imagery offers significant potential for improving object detection. In this paper, we propose WaveMamba, a cross-modality fusion method that efficiently integrates the unique and complementary frequency features of RGB and IR decomposed by Discrete Wavelet Transform (DWT). An improved detection head incorporating the Inverse Discrete Wavelet Transform (IDWT) is also proposed to reduce information loss and produce the final detection results. The core of our approach is the introduction of WaveMamba Fusion Block (WMFB), which facilitates comprehensive fusion across low-/high-frequency sub-bands. Within WMFB, the Low-frequency Mamba Fusion Block (LMFB), built upon the Mamba framework, first performs initial low-frequency feature fusion with channel swapping, followed by deep fusion with an advanced gated attention mechanism for enhanced integration. High-frequency features are enhanced using a strategy that applies an ``absolute maximum" fusion approach. These advancements lead to significant performance gains, with our method surpassing state-of-the-art approaches and achieving average mAP improvements of $4.5$\% on four benchmarks.
Poster
Qing Ma · Pengwei Liang · Xiong Zhou · Jiayi Ma · Junjun Jiang · Zhe Peng

[ Exhibit Hall I ]

Abstract
Gaussian denoising often serves as the initiation of research in the field of image denoising, owing to its prevalence and intriguing properties. However, deep Gaussian denoiser typically generalizes poorly to other types of noises, such as Poisson noise and real-world noise. In this paper, we reveal that deep Gaussian denoisers have an underlying ability to handle other noises with only ten iterations of self-supervised learning, which is referred to as \textit{deep denoiser prior}. Specifically, we first pre-train a Gaussian denoising model in a self-supervised manner. Then, for each test image, we construct a pixel bank based on the self-similarity and randomly sample pseudo-instance examples from it to perform test-time adaptation. Finally, we fine-tune the pre-trained Gaussian denoiser using the randomly sampled pseudo-instances. Extensive experiments demonstrate that our test-time adaptation method helps the pre-trained Gaussian denoiser rapidly improve performance in removing both in-distribution and out-of-distribution noise, achieving superior performance compared to existing single-image denoising methods while also significantly reducing computational time.
Poster
Hyung Kyu Kim · Sangmin Lee · HAK GU KIM

[ Exhibit Hall I ]

Abstract
Speech-driven 3D facial animation aims to synthesize realistic facial motion sequences from given audio, matching the speaker’s speaking style. However, previous works often require priors such as class labels of a speaker or additional 3D facial meshes at inference, which makes them fail to reflect the speaking style and limits their practical use. To address these issues, we propose \textit{MemoryTalker} which enables realistic and accurate 3D facial motion synthesis by reflecting speaker style only with audio input to maximize usability in applications. Our framework consists of two training stages: $<$1-stage$>$ is storing and retrieving general motion (\textit{i.e.}, Memorizing), and $<$2-stage$>$ is to perform the personalized facial motion synthesis (\textit{i.e.}, Animating) with the motion memory stylized by the audio-driven speaking style feature. In this second stage, our model learns about which facial motion types should be emphasized for a particular piece of audio. As a result, our \textit{MemoryTalker} can generate a reliable personalized facial animation without additional prior information. With quantitative and qualitative evaluations, as well as user study, we show the effectiveness of our model and its performance enhancement for personalized facial animation over state-of-the-art methods. Our source code will be released to facilitate further research.
Poster
Haochen Chang · Pengfei Ren · Haoyang Zhang · Liang Xie · Hongbo Chen · Erwei Yin

[ Exhibit Hall I ]

Abstract
In recent years, skeleton-based action recognition has gained significant attention due to its robustness in varying environmental conditions. However, most existing methods struggle to distinguish fine-grained actions due to subtle motion features, minimal inter-class variation, and they often fail to consider the underlying similarity relationships between action classes. To address these limitations, we propose a Hierarchical-aware Orthogonal Disentanglement framework (HiOD). We disentangle coarse-grained and fine-grained features by employing independent spatial-temporal granularity-aware bases, which encode semantic representations at varying levels of granularity. Additionally, we design a cross-granularity feature interaction mechanism that leverages complementary information between coarse-grained and fine-grained features. We further enhance the learning process through hierarchical prototype contrastive learning, which utilizes the parent class hierarchy to guide the learning of coarse-grained features while ensuring the distinguishability of fine-grained features within child classes. Extensive experiments on FineGYM, FSD-10, NTU RGB+D, and NTU RGB+D 120 datasets demonstrate the superiority of our method in fine-grained action recognition tasks.
Poster
Xingyu Hu · Junjun Jiang · Chenyang Wang · Kui Jiang · Xianming Liu · Jiayi Ma

[ Exhibit Hall I ]

Abstract
Unified image fusion aims to integrate complementary information from multi-source images, enhancing image quality through a unified framework applicable to diverse fusion tasks. While treating all fusion tasks as a unified problem facilitates task-invariant knowledge sharing, it often overlooks task-specific characteristics, thereby limiting the overall performance. Existing general image fusion methods incorporate explicit task identification to enable adaptation to different fusion tasks. However, this dependence during inference restricts the model's generalization to unseen fusion tasks. To address these issues, we propose a novel unified image fusion framework named "TITA", which dynamically balances both Task-invariant Interaction and Task-specific Adaptation. For task-invariant interaction, we introduce the Interaction-enhanced Pixel Attention (IPA) module to enhance pixel-wise interactions for better multi-source complementary information extraction. For task-specific adaptation, the Operation-based Adaptive Fusion (OAF) module dynamically adjusts operation weights based on task properties. Additionally, we incorporate the Fast Adaptive Multitask Optimization (FAMO) strategy to mitigate the impact of gradient conflicts across tasks during joint training. Extensive experiments demonstrate that TITA not only achieves competitive performance compared to specialized methods across three image fusion scenarios but also exhibits strong generalization to unseen fusion tasks.
Poster
Jong Hyeon Baek · Jiwon oh · Yeong Jun Koh

[ Exhibit Hall I ]

Abstract
Video Object Segmentation (VOS) in low-light scenarios remains highly challenging due to significant texture loss and severe noise, which often lead to unreliable image feature generation and degraded segmentation performance. To address this issue, we propose EVOLVE, a novel event-guided deformable feature transfer and dual-memory refinement framework for low-light VOS. EVOLVE addresses spatial misalignment between frames and improves object representation by utilizing event-driven cues. The event-guided deformable feature transfer (EDFT) module enhances feature alignment through event-driven deformable convolutions, where offsets derived from event features enable motion-aware spatial adjustments, leading to more precise propagation of object features in reference frames. Furthermore, the dual-memory object transformer (DMOT) iteratively refines object representations by maintaining and updating both image-based and event-based memory representations. Through its memory refinement module (MRM), DMOT selectively enhances relevant object features while suppressing background noise, resulting in stable and temporally coherent segmentation results. Extensive experiments on low-light VOS benchmarks demonstrate that EVOLVE achieves state-of-the-art segmentation performance, surpassing both event-based and image-based VOS methods in accuracy and computational efficiency.
Poster
Yong Liu · Hang Dong · Jinshan Pan · Qingji dong · Kai Chen · Rongxiang Zhang · Lean Fu · Fei Wang

[ Exhibit Hall I ]

Abstract
While diffusion models significantly improve the perceptual quality of super-resolved images, they usually require a large number of sampling steps, resulting in high computational costs and long inference times. Recent efforts have explored reasonable acceleration schemes by reducing the number of sampling steps. However, these approaches treat all regions of the image equally, overlooking the fact that regions with varying levels of reconstruction difficulty require different sampling steps. To address this limitation, we propose PatchScaler, an efficient patch-independent diffusion pipeline for single image super-resolution. Specifically, PatchScaler introduces a Patch-adaptive Group Sampling (PGS) strategy that groups feature patches by quantifying their reconstruction difficulty and establishes shortcut paths with different sampling configurations for each group. To further optimize the patch-level reconstruction process of PGS, we propose a texture prompt that provides rich texture conditional information to the diffusion model. The texture prompt adaptively retrieves texture priors for the target patch from a common reference texture memory. Extensive experiments show that our PatchScaler achieves favorable performance in both quantitative and qualitative evaluations, while significantly speeding up inference.
Poster
Jinshu Chen · Bingchuan Li · Fan Zhang · Songtao Zhao · Qian HE

[ Exhibit Hall I ]

Abstract
Existing solutions for creating high-fidelity digital head avatars encounter various obstacles. Traditional rendering tools offer realistic results, while heavily requiring expert skills. Neural rendering methods are more efficient but often compromise between the generated fidelity and flexibility. We present OneGT that, for the first time, adheres to the frameworks of the rendering tools, while restructuring individual stages of the rendering pipeline through neural networks. OneGT maintains high systemic interpretability, inheriting the superior performances of neural rendering approaches. Specifically, OneGT contains a skeleton-anchoring stage and a texture-rendering stage, in which well-designed Transformers learn the geometric transformations and the proposed reference-perceptible DiT renders the textures respectively. Our framework learns geometric consistency from the innovatively introduced synthetic data, thus achieving superior performance while requiring only 10%-30% of the real-world data typically used by competitive methods. Experimental results demonstrate that OneGT achieves high fidelity in producing portrait avatars, meanwhile maintaining the flexibility of editing.
Poster
Xinyao Liao · Xianfang Zeng · Liao Wang · Gang YU · Guosheng Lin · Chi Zhang

[ Exhibit Hall I ]

Abstract
We propose MotionAgent, enabling fine-grained motion control for text-guided image-to-video generation. The key technique is the motion field agent that converts motion information in text prompts into explicit motion fields, providing flexible and precise motion guidance. Specifically, the agent extracts the object movement and camera motion described in the text, and converts them into object trajectories and camera extrinsics, respectively. An analytical optical flow composition module integrates these motion representations in 3D space and projects them into a unified optical flow. An optical flow adapter takes the flow to control the base image-to-video diffusion model for generating fine-grained controlled videos. After that, an optional rethinking step can be adopted to ensure the generated video is aligned well with motion information in the prompt. The significant improvement in the Video-Text Camera Motion metrics on VBench indicates that our method achieves precise control over camera motion. We further construct a subset of VBench to evaluate the alignment of motion information in the text and the generated video, outperforming other advanced models on motion generation accuracy.
Poster
Wenjie Pei · Qizhong Tan · Guangming Lu · Jiandong Tian · Jun Yu

[ Exhibit Hall I ]

Abstract
Adapting pre-trained image models to video modality has proven to be an effective strategy for robust few-shot action recognition. In this work, we explore the potential of adapter tuning in image-to-video model adaptation and propose a novel video adapter tuning framework, called Disentangled-and-Deformable Spatio-Temporal Adapter (D$^2$ST-Adapter). It features a lightweight design, low adaptation overhead and powerful spatio-temporal feature adaptation capabilities. D$^2$ST-Adapter is structured with an internal dual-pathway architecture that enables built-in disentangled encoding of spatial and temporal features within the adapter, seamlessly integrating into the single-stream feature learning framework of pre-trained image models. In particular, we develop an efficient yet effective implementation of the D$^2$ST-Adapter, incorporating the specially devised anisotropic Deformable Spatio-Temporal Attention as its pivotal operation. This mechanism can be individually tailored for two pathways with anisotropic sampling densities along the spatial and temporal domains in 3D spatio-temporal space, enabling disentangled encoding of spatial and temporal features while maintaining a lightweight design. Extensive experiments by instantiating our method on both pre-trained ResNet and ViT demonstrate the superiority of our method over state-of-the-art methods. Our method is particularly well-suited to challenging scenarios where temporal dynamics are critical for action recognition. Code will be released.
Poster
Weitian Zhang · Yichao Yan · Sijing Wu · Manwen Liao · Xiaokang Yang

[ Exhibit Hall I ]

Abstract
Clothed avatar generation has wide applications in virtual and augmented reality, filmmaking, and more. While existing methods have made progress in creating animatable digital avatars, generating avatars with disentangled components (e.g., body, hair, and clothes) has long been a challenge. In this paper, we propose LayerAvatar, a novel feed-forward diffusion-based method capable of generating high-quality component-disentangled clothed avatars in seconds. We propose a layered UV feature plane representation, where components are distributed in different layers of the Gaussian-based UV feature plane with corresponding semantic labels. This representation can be effectively learned with current feed-forward generation pipelines, facilitating component disentanglement and enhancing details of generated avatars. Based on the well-designed representation, we train a single-stage diffusion model and introduce constrain terms to mitigate the severe occlusion issue of the innermost human body layer. Extensive experiments demonstrate the superior performances of our method in generating highly detailed and disentangled clothed avatars. In addition, we explore its applications in component transfer.
Poster
Jorge Herrera · Yi Zhou · Xin Sun · Zhixin Shu · Chengan He · Soren Pirk · Dominik Michels

[ Exhibit Hall I ]

Abstract
We propose a novel Augmented Mass-Spring (AMS) model for real-time simulation of dense hair at the strand level. Our approach considers the traditional edge, bending, and torsional degrees of freedom in mass-spring systems, but incorporates an additional one-way biphasic coupling with a ghost rest-shape configuration. Through multiple evaluation experiments with varied dynamical settings, we show that AMS improves the stability of the simulation in comparison to mass-spring discretizations, preserves global features, and enables the simulation of non-Hookean effects. Using a heptadiagonal decomposition of the resulting matrix, our approach provides the efficiency advantages of mass-spring systems over more complex constitutive hair models, while enabling a more robust simulation of multiple strand configurations. Finally, our results demonstrate that our framework enables the generation, complex interactivity, and editing of simulation-ready dense hair assets in real time.
Poster
Raiyaan Abdullah · Jared Claypoole · Michael Cogswell · Ajay Divakaran · Yogesh Rawat

[ Exhibit Hall I ]

Abstract
Action recognition models, both unimodal and multimodal, have demonstrated strong generalization in tasks such as zero-shot learning, base-to-novel transfer, and domain adaptation. However, can they effectively transfer high-level motion concepts across diverse contexts, even within similar distributions? For example, can a model recognize the broad action "Pushing" when presented with unknown variations such as "Pushing something from right to left"? To explore this, we introduce a motion transferability framework with three datasets: (1) **Syn-TA**, a synthetic dataset with 3D object motions; (2) **Kinetics400-TA**; and (3) **Something-Something-v2-TA**, both adapted from natural video datasets. We evaluate 13 state-of-the-art models on these benchmarks and observe a significant drop in performance when recognizing high-level actions in novel contexts. Our analysis reveals: 1) Multimodal models struggle more with fine-grained unknown actions than coarse ones; 2) The bias-free **Syn-TA** proves as challenging as real-world datasets, with models showing greater performance drops in controlled settings; 3) Larger models improve transferability when spatial cues dominate but struggle with intensive temporal reasoning, while reliance on object and background cues hinders generalization. We further explore how disentangling coarse and fine motions can improve recognition in temporally challenging datasets. Our study establishes a crucial benchmark for assessing motion transferability in action …
Poster
YeonJi Song · Jaein Kim · Suhyung Choi · Jin-Hwa Kim · Byoung-Tak Zhang

[ Exhibit Hall I ]

Abstract
Human perception involves decomposing complex multi-object scenes into time-static object appearance (i.e., size, shape, color) and time-varying object motion (i.e., position, velocity, acceleration). For machines to achieve human-like intelligence in real-world interactions, understanding these physical properties of objects is essential, forming the foundation for dynamic video prediction. While recent advancements in object-centric transformers have demonstrated potential in video prediction, they primarily focus on object appearance, often overlooking motion dynamics, which is crucial for modeling dynamic interactions and maintaining temporal consistency in complex environments. To address these limitations, we propose OCK, a dynamic video prediction model leveraging object-centric kinematics and object slots. We introduce a novel component named Object Kinematics that comprises explicit object motions, serving as an additional attribute beyond conventional appearance features to model dynamic scenes. The Object Kinematics are integrated into various OCK mechanisms, enabling spatiotemporal prediction of complex object interactions over long video sequences. Our model demonstrates superior performance in handling complex scenes with intricate object attributes and motions, highlighting its potential for applicability in vision-related dynamics learning tasks.
Poster
Kartik Narayan · Vibashan VS · Rama Chellappa · Vishal Patel

[ Exhibit Hall I ]

Abstract
In this work, we introduce FaceXFormer, an end-to-end unified transformer model capable of performing ten facial analysis tasks within a single framework. These tasks include face parsing, landmark detection, head pose estimation, attribute prediction, age, gender, and race estimation, facial expression recognition, face recognition, and face visibility. Traditional face analysis approaches rely on task-specific architectures and pre-processing techniques, limiting scalability and integration. In contrast, FaceXFormer employs a transformer-based encoder-decoder architecture, where each task is represented as a learnable token, enabling seamless multi-task processing within a unified model. To enhance efficiency, we introduce FaceX, a lightweight decoder with a novel bi-directional cross-attention mechanism, which jointly processes face and task tokens to learn robust and generalized facial representations. We train FaceXFormer on ten diverse face perception datasets and evaluate it against both specialized and multi-task models across multiple benchmarks, demonstrating state-of-the-art or competitive performance. Additionally, we analyze the impact of various components of FaceXFormer on performance, assess real-world robustness in "in-the-wild" settings, and conduct a computational performance evaluation. To the best of our knowledge, FaceXFormer is the first model capable of handling ten facial analysis tasks while maintaining real-time performance at $33.21$ FPS. Code and models will be released post-review.
Poster
minjung kim · Minsang Kim · Seung Jun Baek

[ Exhibit Hall I ]

Abstract
The task of generating 3D facial expressions given various situational contexts is important for applications such as virtual avatars or human-robot interactions. The task is, however, challenging not only because it requires a comprehensive understanding of emotion, expression and contexts, but also there rarely are datasets to support the task. We propose ContextFace, a Multi-modal Large Language Model (MLLM) fine-tuned to generate 3D facial expressions depending on complex situational contexts. To overcome the lack of datasets, we perform a context augmentation to existing emotion recognition datasets; we generate plausible situations and quotes from images and emotions to annotate the dataset. Next, we perform visual instruction tuning of MLLMs on context-augmented datasets to boost its capability of visual synthesis from emotions. Experiments show a superior performance of ContextFace in the zero-shot evaluation of contextual emotion recognition. A qualitative evaluation shows that our method generates expressions consistent with diverse contexts and performs complex emotion reasoning, e.g., speculative generation of expressions of occluded faces through interactive prompting.
Poster
honghui xu · Chuangjie Fang · Yibin Wang · Jie Wu · Jianwei Zheng

[ Exhibit Hall I ]

Abstract
Deep unfolding network (DUN) based pansharpening has shed new light on high-resolution/spectrum image acquisition, serving as a computational alternative to physical devices. While with both merits of deep feature learning and acceptable interpretability enjoyed, current pansharpening necessitates substantial effort in approximating the degradation matrices along the spatial and spectral dimensions, yet with performance hardly guaranteed within the complex scenarios. Moreover, as a key step during DUN update, current solutions rely solely on black-box networks to learn the data-driven priors, which further results in laborious architecture crafting and compromised interpretability. To counteract the dilemmas, we propose a new solution, namely \textbf{R}PCA-based \textbf{U}nfolding \textbf{N}etwork (RUN), which shrinks the original two degradations to only one. Specifically, grounded in the significant sparsity of spatial offset components, \textit{i.e.}, the difference between upsampled image and the desired target, we shift the original pansharpening issue into a novel Robust Principal Component Analysis (RPCA)-based paradigm. On that basis, the tricky approximation to the spatial degradation matrix as well as its transposed counterpart is naturally avoided. Specific for the prior learning step of RPCA unfolding, an efficient Nonlinear transformation-based Tensor Nuclear Norm (NTNN) is meticulously engineered, in which the computationally intensive Singular Value Decomposition is avoided with the aid …
Poster
Jin Hu · Mingjia Li · Xiaojie Guo

[ Exhibit Hall I ]

Abstract
Shadows introduce challenges such as reduced brightness, texture deterioration, and color distortion in images, complicating a holistic solution. This study presents ShadowHack, a divide-and-conquer strategy that tackles these complexities by decomposing the original task into luminance recovery and color remedy. To brighten shadow regions and repair the corrupted textures in the luminance space, we customize LRNet, a U-shaped network with a rectified outreach attention module, to enhance information interaction and recalibrate contaminated attention maps. With luminance recovered, CRNet then leverages cross-attention mechanisms to revive vibrant colors, producing visually compelling results. Extensive experiments on multiple datasets are conducted to demonstrate the superiority of ShadowHack over existing state-of-the-art solutions both quantitatively and qualitatively, highlighting the effectiveness of our design. Our code will be made publicly available.
Poster
Tingwei Li · Jun Bao · Zhenzhong Kuang · Buyu Liu

[ Exhibit Hall I ]

Abstract
This work focuses on unsupervised 3D gaze estimation. Specifically, we adopt a learning-by-synthesis approach, where a gaze prediction model is trained using simulated data. Unlike existing methods that lack explicit and accurate control over facial images—particularly the eye regions—we propose a geometrically meaningful 3D representation that enables diverse, precise, and explicit control over illumination, eye regions, and gaze targets using only facial images. Given a sequence of facial images, our method constructs a mesh representation where each mesh is associated with 3D Gaussians, allowing for explicit lighting control. To further enhance realism, we introduce eye-focused constraints, including a rotation symmetry protocol, as well as geometry and appearance losses for the eye regions, alongside conventional learning objectives. Additionally, we incorporate a virtual screen target and rotate the eyeballs accordingly, generating more accurate pseudo gaze directions paired with realistic facial images. We validate our approach through extensive experiments on three benchmarks. The results demonstrate that gaze estimators trained using our method outperform all unsupervised baselines and achieve performance comparable to cross-dataset approaches. Furthermore, our method generates the most visually realistic images, as confirmed by both objective and subjective image quality metrics.
Poster
Hae Jin Song · Laurent Itti

[ Exhibit Hall I ]

Abstract
Recent breakthroughs and rapid integration of generative models (GMs) have sparked interest in the problem of model attribution and their fingerprints. For instance, service providers need reliable methods of authenticating their models to protect their IP, while users and law enforcement seek to verify the source of generated content for accountability and trust. In addition, a growing threat of model collapse is arising, as more model-generated data are being fed back into sources (e.g., YouTube) that are often harvested for training (``regurgitative training''), heightening the need to differentiate synthetic from human data. Yet, a gap still exists in understanding generative models' fingerprints, we believe, stemming from the lack of a formal framework that can define, represent, and analyze the fingerprints in a principled way. To address this gap, we take a geometric approach and propose a new definition of artifact and fingerprint of generative models using Riemannian geometry, which allows us to leverage the rich theory of differential geometry. Our new definition generalizes previous work (Song et al, 2024) to non-Euclidean manifolds by learning Riemannian metrics from data and replacing the Euclidean distances and nearest-neighbor search with geodesic distances and $k$NN-based Riemannian center of mass. We apply our theory to …
Poster
Yingyu Liang · Zhizhou Sha · Zhenmei Shi · Zhao Song · Mingda Wan · Yufa Zhou

[ Exhibit Hall I ]

Abstract
Diffusion models have made rapid progress in generating high-quality samples across various domains. However, a theoretical understanding of the Lipschitz continuity and second momentum properties of the diffusion process is still lacking. In this paper, we bridge this gap by providing a detailed examination of these smoothness properties for the case where the target data distribution is a mixture of Gaussians, which serves as a universal approximator for smooth densities such as image data. We prove that if the target distribution is a $k$-mixture of Gaussians, the density of the entire diffusion process will also be a $k$-mixture of Gaussians. We then derive tight upper bounds on the Lipschitz constant and second momentum that are independent of the number of mixture components $k$. Finally, we apply our analysis to various diffusion solvers, both SDE and ODE based, to establish concrete error guarantees in terms of the total variation distance and KL divergence between the target and learned distributions. Furthermore, our preliminary experiments support our theoretical analysis. Our results provide deeper theoretical insights into the dynamics of the diffusion process under common data distributions.
Poster
Juntao Jian · Xiuping Liu · Zixuanchen Zixuanchen · Manyi Li · Jian Liu · Ruizhen Hu

[ Exhibit Hall I ]

Abstract
Recent advances in dexterous grasping synthesis have demonstrated significant progress in producing reasonable and plausible grasps for many task purposes. But it remains challenging to generalize to unseen object categories and diverse task instructions. In this paper, we propose G-DexGrasp, a retrieval-augmented generation approach that can produce high-quality dexterous hand configurations for unseen object categories and language-based task instructions. The key is to retrieve generalizable grasping priors, including the fine-grained contact part and the affordance-related distribution of relevant grasping instances, for the following synthesis pipeline. Specifically, the fine-grained contact part and affordance act as generalizable guidance to infer reasonable grasping configurations for unseen objects with a generative model, while the relevant grasping distribution plays as regularization to guarantee the plausibility of synthesized grasps during the subsequent refinement optimization. Our comparison experiments validate the effectiveness of our key designs for generalization and demonstrate the remarkable performance against the existing approaches.
Poster
Hui Li · Xiaoyu Ren · Hongjiu Yu · Ying Chen · Kai Li · L Wang · Xiongkuo Min · Huiyu Duan · Guangtao Zhai · Xu Liu

[ Exhibit Hall I ]

Abstract
Facial attractiveness prediction (FAP) has long been an important computer vision task, which could be widely applied in live videos with facial retouching. However, previous FAP datasets are either small or closed-source. Moreover, the corresponding FAP models exhibit limited generalization and adaptation ability.To overcome these limitations, we introduce the first large-scale FAP dataset LiveBeauty specifically designed for live video scenarios wherein face images may be real-time processed for aesthetics purposes.10,000 face images are collected directly from a live streaming platform, with 200,000 corresponding attractiveness annotations obtained from a well-devised subjective experiment, making LiveBeauty the largest open-access FAP dataset. Based on the built dataset, a novel FAP method named Facial Prior Enhanced Multi-modal model (FPEM) is proposed to measure the attractiveness of facial images.Extensive experiments conducted on both LiveBeauty and other open-source FAP datasets demonstrate that our proposed method achieves state-of-the-art performance. The dataset will be available soon.
Poster
Huilin Xu · Jian Ding · Jiakun Xu · Ruixiang Wang · Jun Chen · Jinjie Mai · Yanwei Fu · Bernard Ghanem · Feng Xu · Mohamed Elhoseiny

[ Exhibit Hall I ]

Abstract
Bimanual manipulation is crucial in robotics, enabling complex tasks in industrial automation and household services. However, it poses significant challenges due to the high-dimensional action space and intricate coordination requirements. While video prediction has been recently studied for representation learning and control, leveraging its ability to capture rich dynamic and behavioral information, its potential for enhancing bimanual coordination remains underexplored. To bridge this gap, we propose a unified diffusion-based framework for the joint optimization of video and action prediction. Specifically, we propose a multi-frame latent prediction strategy that encodes future states in a compressed latent space, preserving task-relevant features. Furthermore, we introduce a unidirectional attention mechanism where video prediction is conditioned on the action, but action prediction remains independent of video prediction. This design allows us to omit video prediction during inference, significantly enhancing efficiency. Experiments on two simulated benchmarks and a real-world setting demonstrate a significant improvement in the success rate over the strong baseline ACT using our method, achieving a 24.9% increase on ALOHA, an 11.1% increase on RoboTwin, and a 32.5% increase in real-world experiments.
Poster
Jongseob Yun · Yong-Hoon Kwon · Min-Gyu Park · Ju-Mi Kang · Min-Ho Lee · Inho Chang · Ju Yoon · Kuk-Jin Yoon

[ Exhibit Hall I ]

Abstract
We address the 3D head reconstruction problem and the facial correspondence search problem in a unified framework, named as $\textbf{WarpHE4D}$. The underlying idea is to establish correspondences between the facial image and the fixed UV texture map by exploiting powerful self-supervised visual representations, $\textit{i.e.}$, DINOv2. In other words, we predict UV coordinates for each pixel that maps the pixel to a point in the UV map. At the same time, we predict the nose-centered depth map leveraged by the facial correspondences. Note that our framework does not require fitting a template model, $\text{e.g.,}$ 3DMM, to the image, which directly regresses 4D vectors for each pixel. The experimental results show that our approach not only improves the accuracy of head geometry but also significantly improves the robustness under pose or viewpoint variations, particularly when the head is rotated more than 90 degrees. We believe our method can be a groundwork for photorealistic head avatar generation, even in uncalibrated camera settings.
Poster
Kai Jia · Tengyu Liu · Mingtao Pei · Yixin Zhu · Siyuan Huang

[ Exhibit Hall I ]

Abstract
Synthesizing complex and diverse human-object interactions (HOI) based on minimal instructions is crucial for advancing character animation and embodied AI. Existing approaches primarily rely on data-intensive learning models, which struggle to replicate the nuanced, compositional structure of daily HOI motions. In this paper, we propose a novel framework that leverages a generalizable representation of HOI primitives defined by relative geometry. Our approach uses an object-centric hierarchical planning process, integrating high-level planning, key pose generation, and intermediate motion synthesis to construct realistic HOI sequences achieving novel tasks. Key poses, defined by reusable contact mode primitives, serve as flexible constraints that guide the synthesis of intricate interaction motions through a symbolic planner. Our system generates intermediate motions by first planning object trajectories with collision avoidance, followed by object-motion-guided human motion generation. To ensure coherence and realism, we apply a post-optimization process that aligns motions with planned constraints, resulting in high-quality interaction sequences. Our framework supports zero-shot transfer, enabling the synthesis of novel HOI motions without specific training examples. Experimental results demonstrate that our approach significantly enhances the adaptability, diversity, and quality of synthesized interactions, marking a meaningful step forward in flexible HOI motion generation.
Poster
Ziyun Wang · Ruijun Zhang · Zi-Yan Liu · Yufu Wang · Kostas Daniilidis

[ Exhibit Hall I ]

Abstract
This paper addresses the challenges of estimating a continuous-time field from a stream of events. Existing Human Mesh Recovery (HMR) methods rely predominantly on frame-based approaches, which are prone to aliasing and inaccuracies due to limited temporal resolution and motion blur. In this work, we predict a continuous-time human motion field from events caused by human motion. Prior state-of-the-art methods rely on computationally intensive optimization across a fixed number of poses at high frame rates, which becomes prohibitively expensive as we increase the temporal resolution. In comparison, our model leverages a recurrent feed-forward neural network to predict human motion in the latent space of possible human motions. We present the first work that replaces traditional event volume-based discrete-time pre-dictions with a continuous human motion field represented as a time-implicit function, enabling parallel pose queries at arbitrary temporal resolutions. To advance the evaluation of continuous-time human pose estimation, we introduce the Beam-splitter Event Agile Human Motion Dataset—a hardware-synchronized high-speed human dataset tailored for this purpose. EvHuman improves joint errors by 23.8 % compared to previous event human methods, while reducing the computational time by 69%.
Poster
Yunyang Xiong · Chong Zhou · Xiaoyu Xiang · Lemeng Wu · Chenchen Zhu · Zechun Liu · Saksham Suri · Balakrishnan Varadarajan · Ramya Akula · Forrest Iandola · Raghuraman Krishnamoorthi · Bilge Soran · Vikas Chandra

[ Exhibit Hall I ]

Abstract
Segment Anything Model 2 (SAM 2) has emerged as a powerful tool for video object segmentation and tracking anything. Key components of SAM 2 that drive the impressive video object segmentation performance include a large multistage image encoder for frame feature extraction and a memory mechanism that stores memory contexts from past frames to help current frame segmentation. The high computation complexity of image encoder and memory module has limited its applications in real-world tasks, e.g., video object segmentation on mobile devices. To address this limitation, we propose EfficientTAMs, lightweight end-to-end track anything models that produce high-quality results with low latency and small model size. Our idea is based on adopting lightweight Vision Transformer (ViT) as an image encoder for video object segmentation, and introducing an efficient memory module, which reduces the complexity for both frame feature extraction and memory computation for current frame segmentation. We take vanilla lightweight ViTs and efficient memory module to build EfficientTAMs, and train the models on SA-1B and SA-V datasets for video object segmentation and track anything tasks. We evaluate on multiple video segmentation benchmarks including semi-supervised VOS and promptable video segmentation, and find that our proposed EfficientTAM with lightweight ViT performs comparably to …
Poster
Yiyang Su · Yunping Shi · Feng Liu · Xiaoming Liu

[ Exhibit Hall I ]

Abstract
Recently, research interest in person re-identification (ReID) has increasingly focused on video-based scenarios, essential for robust surveillance and security in varied and dynamic environments. However, existing video-based ReID methods often overlook the necessity of identifying and selecting the most discriminative features from both videos in a query-gallery pair for effective matching. To address this challenge, we propose a novel Hierarchical and Adaptive Mixture of Biometric Experts (HAMoBE) framework, which leverages multi-scale features from a pre-trained large model (\emph{e.g.}, CLIP) and is designed to mimic human perceptual mechanisms by independently modeling key biometric features—appearance, Static body shape, and dynamic gait—and adaptively integrating them. Specifically, HAMoBE includes two levels: the first level extracts low-level features from multi-scale representations provided by the frozen large model, while the second level consists of specialized experts focusing on long-term, short-term, and temporal features. To ensure robust matching, we introduce a new dual-input decision gating network that dynamically adjusts the contributions of each expert based on their relevance to the input scenarios. Extensive evaluations on benchmarks like MEVID demonstrate that our approach yields significant performance improvements (+$11.0\%$ Rank1).
Poster
Jingyu Liu · Zijie Xin · Yuhan Fu · Ruixiang Zhao · Bangxiang Lan · Xirong Li

[ Exhibit Hall I ]

Abstract
Sketch animation, which brings static sketches to life by generating dynamic video sequences, has found widespread applications in GIF design, cartoon production, and daily entertainment. While current sketch animation methods perform well in single-object sketch animation, they struggle in multi-object scenarios. By analyzing their failures, we summarize two challenges of transitioning from single-object to multi-object sketch animation: object-aware motion modeling and complex motion optimization. For multi-object sketch animation, we propose MoSketch based on iterative optimization through Score Distillation Sampling (SDS), without any other data for training. We propose four modules: LLM-based scene decomposition, LLM-based motion planning, motion refinement network and compositional SDS, to tackle the two challenges in a divide-and-conquer strategy. Extensive qualitative and quantitative experiments demonstrate the superiority of our method over existing sketch animation approaches. MoSketch takes a pioneering step towards multi-object sketch animation, opening new avenues for future research and applications. The code will be released.
Poster
Yuanlin Wang · Ruiqin Xiong · Rui Zhao · Jin Wang · Xiaopeng Fan · Tiejun Huang

[ Exhibit Hall I ]

Abstract
While image signals are typically defined on a regular 2D grid, there are scenarios where they are only available at irregular positions. In such cases, reconstructing a complete image on regular grid is essential. This paper introduces ISP2HRNet, an end-to-end network designed to reconstruct high resolution image from irregularly sampled pixels that do not fall on a regular grid. To handle the challenges brought by irregular sampling, we propose an architecture to extract gradient structure hierarchically and learn continuous image representation. Specifically, we derive image gradient for each irregularly sampled pixel and further learn higher order gradient structural features according to the geometric and photometric information at the vertices of neighboring triangles. To convert the features from irregular pixels to regular grid, we propose a dual branch content-dependent weight generator to adaptively fuse the information from neighboring irregular pixels. Subsequently, an encoder captures deep structural details on regular grid and forms latent codes. Implicit neural representation parameterized by multi-layer perceptron decodes the latent codes and coordinates to pixel values for generating high resolution image. Experimental results demonstrate that the proposed network can effectively solve the problem of high resolution image reconstruction from irregularly sampled pixels and achieve promising results. The …
Poster
Michael Bernasconi · Abdelaziz Djelouah · Yang Zhang · Markus Gross · Christopher Schroers

[ Exhibit Hall I ]

Abstract
Video super-resolution (VSR) methods typically exploit information across multiple frames to achieve high quality upscaling, with recent approaches demonstrating impressive performance. Nevertheless, challenges remain, particularly in effectively leveraging information over long distances. To address this limitation in VSR, we propose a strategy for long distance information propagation with a flexible fusion module that can optionally also assimilate information from additional high resolution reference images. We design our overall approach such that it can leverage existing pre-trained VSR backbones and adapt the feature upscaling module to support arbitrary scaling factors. Our experiments demonstrate that we can achieve state-of-the-art results on perceptual metrics and deliver more visually pleasing results compared to existing solutions.
Poster
Jungwoo Huh · Yeseung Park · Seongjean Kim · Jungsu Kim · Sanghoon Lee

[ Exhibit Hall I ]

Abstract
Human motion estimation models typically assume a fixed number of input frames, making them sensitive to variations in frame rate and leading to inconsistent motion predictions across different temporal resolutions. This limitation arises because input frame rates inherently determine the temporal granularity of motion capture, causing discrepancies when models trained on a specific frame rate encounter different sampling frequencies. To address this challenge, we propose MBTI (Masked Blending Transformers with Implicit Positional Encoding), a frame rate-agnostic human motion estimation framework designed to maintain temporal consistency across varying input frame rates. Our approach leverages a masked autoencoder (MAE) architecture with masked token blending, which aligns input tokens with a predefined high-reference frame rate, ensuring a standardized temporal representation. Additionally, we introduce implicit positional encoding, which encodes absolute time information using neural implicit functions, enabling more natural motion reconstruction beyond discrete sequence indexing. By reconstructing motion at a high reference frame rate and optional downsampling, MBTI ensures both frame rate generalization and temporal consistency. To comprehensively evaluate MBTI, we introduce EMDB-FPS, an augmented benchmark designed to assess motion estimation robustness across multiple frame rates in both local and global motion estimation tasks. To further assess MBTI’s robustness, we introduce the Motion Consistency …
Poster
Anja Delić · Matej Grcic · Siniša Šegvić

[ Exhibit Hall I ]

Abstract
Detecting anomalous human behaviouris an important visual taskin safety-critical applicationssuch as healthcare monitoring,workplace safety,or public surveillance.In these contexts,abnormalities are often reflectedwith unusual human poses.Thus, we propose SeeKer,a method for detecting anomaliesin sequences of human skeletons.Our method formulates the skeleton sequence densitythrough autoregressive factorization at the keypoint level.The corresponding conditional distributionsrepresent probable keypoint locations given prior skeletal motion.We formulate the joint distribution of the considered skeletonas causal prediction of conditional Gaussiansacross its constituent keypoints.A skeleton is flagged as anomalous if its keypoint locations surprise our model(i.e. receive a low density).In practice, our anomaly score is a weighted sum of per-keypoint log-conditionals,where the weights account for the confidence of the underlying keypoint detector.Despite its conceptual simplicity,SeeKer surpasses all previous methodson the UBnormal and MSAD-HR datasetswhile delivering competitive performanceon the ShanghaiTech dataset.
Poster
Ekkasit Pinyoanuntapong · Muhammad Usama Saleem · Korrawe Karunratanakul · Pu Wang · Hongfei Xue · Chen Chen · chuan guo · Junli Cao · Jian Ren · Sergey Tulyakov

[ Exhibit Hall I ]

Abstract
Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, these models struggle to achieve high-precision control while maintaining high-quality motion generation. To address these challenges, we propose MaskControl, the first approach to introduce controllability to the generative masked motion model. Our approach introduces two key innovations. First, \textit{Logits Regularizer} implicitly perturbs logits at training time to align the distribution of motion tokens with the controlled joint positions, while regularizing the categorical token prediction to ensure high-fidelity generation. Second, \textit{Logit Optimization} explicitly optimizes the predicted logits during inference time, directly reshaping the token distribution that forces the generated motion to accurately align with the controlled joint positions. Moreover, we introduce \textit{Differentiable Expectation Sampling (DES)} to combat the non-differential distribution sampling process encountered by logits regularizer and optimization. Extensive experiments demonstrate that MaskControl outperforms state-of-the-art methods, achieving superior motion quality (FID decreases by ~77\%) and higher control precision (average error 0.91 vs. 1.08). Additionally, MaskControl enables diverse applications, including any-joint-any-frame control, body-part timeline control, and zero-shot objective control. Video visualization can be found at \url{https://anonymous-ai-agent.github.io/CAM}
Poster
Huiyang Hu · Peijin Wang · Hanbo Bi · Boyuan Tong · Zhaozhi Wang · Wenhui Diao · Hao Chang · Yingchao Feng · Ziqi Zhang · Yaowei Wang · Qixiang Ye · Kun Fu · Xian Sun

[ Exhibit Hall I ]

Abstract
Remote sensing foundation models largely break away from the traditional paradigm of designing task-specific models, offering greater scalability across multiple tasks. However, they face challenges such as low computational efficiency and limited interpretability, especially when dealing with large-scale remote sensing images. To overcome these, we draw inspiration from heat conduction, a physical process modeling local heat diffusion. Building on this idea, we are the first to explore the potential of using the parallel computing model of heat conduction to simulate the local region correlations in high-resolution remote sensing images, and introduce RS-vHeat, an efficient multi-modal remote sensing foundation model. Specifically, RS-vHeat 1) applies the Heat Conduction Operator (HCO) with a complexity of $O(N^{1.5})$ and a global receptive field, reducing computational overhead while capturing remote sensing object structure information to guide heat diffusion; 2) learns the frequency distribution representations of various scenes through a self-supervised strategy based on frequency domain hierarchical masking and multi-domain reconstruction; 3) significantly improves efficiency and performance over state-of-the-art techniques across 4 tasks and 10 datasets. Compared to attention-based remote sensing foundation models, we reduce memory usage by 84\%, FLOPs by 24\% and improves throughput by 2.7 times. The code will be made publicly available.
Poster
Jiwen Yu · Yiran Qin · Xintao Wang · Pengfei Wan · Di ZHANG · Xihui Liu

[ Exhibit Hall I ]

Abstract
Generative videos have the potential to revolutionize game development by autonomously creating new content. In this paper, we present GameFactory, a framework for action-controlled scene-generalizable game video generation. We first address the fundamental challenge of action controllability by introducing GF-Minecraft, a action-annotated game video dataset without human bias, and developing a action control module that enables precise control over both keyboard and mouse inputs. We further extend to support autoregressive generation for unlimited-length interactive videos.More importantly, GameFactory tackles the critical challenge of scene-generalizable action control, which most existing methods fail to address. To enable the creation of entirely new and diverse games beyond fixed styles and scenes, we leverage the open-domain generative priors from pre-trained video diffusion models. To bridge the domain gap between open-domain priors and small-scale game datasets, we propose a multi-phase training strategy with a domain adapter that decouples game style learning from action control. This decoupling ensures that action control learning is no longer bound to specific game styles, thereby achieving scene-generalizable action control.Experimental results demonstrate that GameFactory effectively generates open-domain action-controllable game videos, representing a significant step forward in AI-driven game generation. Our dataset and code will be publicly available.
Poster
Parnian Zameni · Yuhan Shen · Ehsan Elhamifar

[ Exhibit Hall I ]

Abstract
We introduce MOSCATO: a new benchmark for predicting the evolving states of multiple objects through long procedural videos with multiple actions. While prior work in object state prediction has typically focused on a single object undergoing one or a few state changes, real-world tasks require tracking many objects whose states evolve over multiple actions. Given the high cost of gathering framewise object-state labels for many videos, we develop a weakly-supervised multiple object state prediction framework, which only uses action labels during training. Specifically, we propose a novel Pseudo-Label Acquisition (PLA) pipeline that integrates large language models, vision–language models, and action segment annotations to generate fine-grained, per-frame object-state pseudo-labels for training a Multiple Object State Prediction (MOSP) network. We further devise a State–Action Interaction (SAI) module that explicitly models the correlations between actions and object states, thereby improving MOSP. To facilitate comprehensive evaluation, we create the MOSCATO benchmark b y augmenting three egocentric video datasets with framewise object-state annotations. Experiments show that our multi-stage pseudo-labeling approach and SAI module significantly boost performance over zero-shot VLM baselines and naive extensions of existing methods, underscoring the importance of holistic action–state modeling for fine-grained procedural video understanding.
Poster
Fei Yin · Mallikarjun Reddy · Chun-Han Yao · Rafal Mantiuk · Varun Jampani

[ Exhibit Hall I ]

Abstract
We present a novel framework for generating high-quality, animatable 4D avatar from a single image. While recent advances have shown promising results in 4D avatar creation, existing methods either require extensive multiview data or struggle with geometry accuracy and identity consistency. To address these limitations, we propose a comprehensive system that leverages geometry, image, and video priors to create full-view, animatable avatars. Our approach first obtains initial coarse geometry through 3D-GAN inversion. Then, it enhances multiview textures using depth-guided warping signals for cross-view consistency with the help of the image diffusion model. To handle expression animation, we incorporate a video prior with synchronized driving signals across viewpoints. We further introduce a Consistent-Inconsistent training to effectively handle data inconsistencies during 4D reconstruction. Experimental results demonstrate that our method achieves superior quality compared to the prior art, while maintaining consistency across different viewpoints and expressions.
Poster
Yidi Liu · Dong Li · Yuxin Ma · Jie Huang · Wenlong Zhang · Xueyang Fu · Zheng-Jun Zha

[ Exhibit Hall I ]

Abstract
Ultra-high-definition (UHD) image restoration often faces computational bottlenecks and information loss due to its extremely high resolution. Existing studies based on Variational Autoencoders (VAE) improve efficiency by transferring the image restoration process from pixel space to latent space. However, degraded components are inherently coupled with background elements in degraded images, both information loss during compression and information gain during compensation remain uncontrollable. These lead to restored images often exhibiting image detail loss and incomplete degradation removal. To address this issue, we propose a Controlled Differential Disentangled VAE, which utilizes Hierarchical Contrastive Disentanglement Learning and an Orthogonal Gated Projection Module to guide the VAE to actively discard easily recoverable background information while encoding more difficult-to-recover degraded information into the latent space. Additionally, we design a Complex Invertible Multiscale Fusion Network to handle background features, ensuring their consistency, and utilize a latent space restoration network to transform the degraded latent features, leading to more accurate restoration results. Extensive experimental results demonstrate that our method effectively alleviates the information loss problem in VAE models while ensuring computational efficiency, significantly improving the quality of UHD image restoration, and achieves state-of-the-art results in six UHD restoration tasks with only 1M parameters.
Poster
Kaining Ying · Hengrui Hu · Henghui Ding

[ Exhibit Hall I ]

Abstract
This work addresses motion-guided few-shot video object segmentation (FSVOS), which aims to segment dynamic objects in videos based on a few annotated examples with the same motion patterns. Existing FSVOS datasets and methods typically focus on object categories, which are static attributes that ignore the rich temporal dynamics in videos, limiting their application in scenarios requiring motion understanding. To fill this gap, we introduce **MOVE**, a large-scale dataset specifically designed for motion-guided FSVOS. Based on MOVE, we comprehensively evaluate 6 state-of-the-art methods from 3 different related areas across 2 experimental settings. Our results reveal that current methods struggle to address motion-guided FSVOS, prompting us to analyze the associated challenges and propose a baseline method, Decoupled Motion-Appearance Network (**DMA**). Experiments demonstrate that our approach achieves superior performance in few-shot motion understanding, establishing a solid foundation for future research in this direction.
Poster
Jianqi Chen · Biao Zhang · Xiangjun Tang · Peter Wonka

[ Exhibit Hall I ]

Abstract
We present **V2M4**, a novel 4D reconstruction method that directly generates a usable 4D mesh animation asset from a single monocular video. Unlike existing approaches that rely on priors from multi-view image and video generation models, our method is based on native 3D mesh generation models. Naively applying 3D mesh generation models to generate a mesh for each frame in a 4D task can lead to issues such as incorrect mesh poses, misalignment of mesh appearance, and inconsistencies in mesh geometry and texture maps. To address these problems, we propose a structured workflow that includes camera search and mesh reposing, condition embedding optimization for mesh appearance refinement, pairwise mesh registration for topology consistency, and global texture map optimization for texture consistency. Our method outputs high-quality 4D animated assets that are compatible with mainstream graphics and game software. Experimental results across a variety of animation types and motion amplitudes demonstrate the generalization and effectiveness of our method. Please refer to our Supplementary Files for video displays.
Poster
Donggeun Lim · Jinseok Bae · Inwoo Hwang · Seungmin Lee · Hwanhee Lee · Young Min Kim

[ Exhibit Hall I ]

Abstract
In this work, we propose a framework that creates a lively virtual dynamic scene with contextual motions of multiple humans. Generating multi-human contextual motion requires holistic reasoning over dynamic relationships among human-human and human-scene interactions. We adapt the power of a large language model (LLM) to digest the contextual complexity within textual input and convert the task into tangible subproblems such that we can generate multi-agent behavior beyond the scale that was not considered before. Specifically, our event generator formulates the temporal progression of a dynamic scene into a sequence of small events. Each event calls for a well-defined motion involving relevant characters and objects. Next, we synthesize the motions of characters at positions sampled based on spatial guidance. We employ a high-level module to deliver scalable yet comprehensive context, translating events into relative descriptions that enable the retrieval of precise coordinates. As the first to address this problem at scale and with diversity, we offer a benchmark to assess diverse aspects of contextual reasoning. Benchmark results and user studies show that our framework effectively captures scene context with high scalability.
Poster
Yufei Zhu · Yiming Zhong · Zemin Yang · Peishan Cong · Jingyi Yu · Xinge Zhu · Yuexin Ma

[ Exhibit Hall I ]

Abstract
Dexterous robotic hands often struggle to generalize effectively in complex environments due to the limitations of models trained on low-diversity data. However, the real world presents an inherently unbounded range of scenarios, making it impractical to account for every possible variation. A natural solution is to enable robots learning from experience in complex environments—an approach akin to evolution, where systems improve through continuous feedback, learning from both failures and successes, and iterating toward optimal performance. Motivated by this, we propose EvolvingGrasp, an evolutionary grasp generation method that continuously enhances grasping performance through efficient preference alignment. Specifically, we introduce Handpose-wise Preference Optimization (HPO), which allows the model to continuously align with preferences from both positive and negative feedback while progressively refining its grasping strategies. To further enhance efficiency and reliability during online adjustments, we incorporate a Physics-aware Consistency Model within HPO, which accelerates inference, reduces the number of timesteps needed for preference fine-tuning, and ensures physical plausibility throughout the process.Extensive experiments across four benchmark datasets demonstrate state-of-the-art performance of our method in grasp success rate and sampling efficiency. Our results validate that EvolvingGrasp enables evolutionary grasp generation, ensuring robust, physically feasible, and preference-aligned grasping in both simulation and real scenarios.
Poster
wanchang Yu · Qing Zhang · Rongjia Zheng · Wei-Shi Zheng

[ Exhibit Hall I ]

Abstract
We present a diffusion-based portrait shadow removal approach that can robustly produce high-fidelity results. Unlike previous methods, we cast shadow removal as diffusion-based inpainting. To this end, we first train a shadow-independent structure extraction network on a real-world portrait dataset with various synthetic lighting conditions, which allows to generate a shadow-independent structure map including facial details while excluding the unwanted shadow boundaries. The structure map is then used as condition to train a structure-guided inpainting diffusion model for removing shadows in a generative manner. Finally, to restore the fine-scale details (e.g., eyelashes, moles and spots) that may not be captured by the structure map, we take the gradients inside the shadow regions as guidance and train a detail restoration diffusion model to refine the shadow removal result. Extensive experiments on the benchmark datasets show that our method clearly outperforms existing methods, and is effective to avoid previously common issues such as facial identity tampering, shadow residual, color distortion, structure blurring, and loss of details. Our code will be made publicly available.
Poster
Kangle Deng · Hsueh-Ti Derek Liu · Yiheng Zhu · Xiaoxia Sun · Chong Shang · Kiran Bhat · Deva Ramanan · Jun-Yan Zhu · Maneesh Agrawala · Tinghui Zhou

[ Exhibit Hall I ]

Abstract
Many 3D generative models rely on variational autoencoders (VAEs) to learn compact shape representations. However, existing methods encode all shapes into a fixed-size token, disregarding the inherent variations in scale and complexity across 3D data. This leads to inefficient latent representations that can compromise downstream generation. We address this challenge by introducing Octree-based Adaptive Tokenization, a novel framework that adjusts the dimension of latent representations according to shape complexity. Our approach constructs an adaptive octree structure guided by a quadric-error-based subdivision criterion and allocates a shape latent vector to each octree cell using a query-based transformer. Building upon this tokenization, we develop an octree-based autoregressive generative model that effectively leverages these variable-sized representations in shape generation. Extensive experiments demonstrate that our approach reduces token counts by 50% compared to fixed-size methods while maintaining comparable visual quality. When using a similar token length, our method produces significantly higher-quality shapes. When incorporated with our downstream generative model, our method creates more detailed and diverse 3D content than existing approaches.
Poster
Xinyu Liu · Guolei Sun · Cheng Wang · Yixuan Yuan · Ender Konukoglu

[ Exhibit Hall I ]

Abstract
High-resolution (HR) medical videos are vital for accurate diagnosis, yet are hard to acquire due to hardware limitations and physiological constraints. Clinically, the collected low-resolution (LR) medical videos present unique challenges for video super-resolution (VSR) models, including camera shake, noise, and abrupt frame transitions, which result in significant optical flow errors and alignment difficulties. Additionally, tissues and organs exhibit continuous and nuanced structures, but current VSR models are prone to introducing artifacts and distorted features that can mislead doctors. To this end, we propose MedVSR, a tailored framework for medical VSR. It first employs Cross State-Space Propagation (CSSP) to address the imprecise alignment by projecting distant frames as control matrices within state-space models, enabling the selective propagation of consistent and informative features to neighboring frames for effective alignment. Moreover, we design an Inner State-Space Reconstruction (ISSR) module that enhances tissue structures and reduces artifacts with joint long-range spatial feature learning and large-kernel short-range information aggregation. Experiments across four datasets in diverse medical scenarios, including endoscopy and cataract surgeries, show that MedVSR significantly outperforms existing state-of-the-art models in reconstruction performance and efficiency.
Poster
Yu-Cheng Lin · Yu-Syuan Xu · Hao-Wei Chen · Hsien-Kai Kuo · Chun-Yi Lee

[ Exhibit Hall I ]

Abstract
Image restoration is a key task in low-level computer vision that aims to reconstruct high-quality images from degraded inputs. The emergence of Vision Mamba, which draws inspiration from the advanced state space model Mamba, marks a significant advancement in this field. Vision Mamba demonstrates excellence in modeling long-range dependencies with linear complexity, a crucial advantage for image restoration tasks. Despite its strengths, Vision Mamba encounters challenges in low-level vision tasks, including computational complexity that scales with the number of scanning sequences and local pixel forgetting. To address these limitations, this study introduces Efficient All-Around Mamba (EAMamba), an enhanced framework that incorporates a Multi-Head Selective Scan Module (MHSSM) with an all-around scanning mechanism. MHSSM efficiently aggregates multiple scanning sequences, which avoids increases in computational complexity and parameter count. The all-around scanning strategy implements multiple patterns to capture holistic information and resolves the local pixel forgetting issue. Our experimental evaluations validate these innovations across several restoration tasks, including super resolution, denoising, deblurring, and dehazing. The results validate that EAMamba achieves a significant 31-89% reduction in FLOPs while maintaining favorable performance compared to existing low-level Vision Mamba methods.
Poster
Shuyu Yang · Yaxiong Wang · Li Zhu · Zhedong Zheng

[ Exhibit Hall I ]

Abstract
Text-based person search aims to retrieve specific individuals across camera networks using natural language descriptions. However, current benchmarks often exhibit biases towards common actions like walking or standing, neglecting the critical need for identifying abnormal behaviors in real-world scenarios. To meet such demands, we propose a new task, text-based person anomaly search, locating pedestrians engaged in both routine or anomalous activities via text. To enable the training and evaluation of this new task, we construct a large-scale image-text Pedestrian Anomaly Behavior (PAB) benchmark, featuring a broad spectrum of actions, e.g., running, performing, playing soccer, and the corresponding anomalies, e.g., lying, being hit, and falling of the same identity. The training set of PAB comprises 1,013,605 synthesized image-text pairs of both normalities and anomalies, while the test set includes 1,978 real-world image-text pairs. To validate the potential of PAB, we introduce a cross-modal pose-aware framework, which integrates human pose patterns with identity-based hard negative pair sampling. Extensive experiments on the proposed benchmark show that synthetic training data facilitates the fine-grained behavior retrieval, and the proposed pose-aware method arrives at 84.93% recall@1 accuracy, surpassing other competitive methods.
Poster
Wenkun He · Yun Liu · Ruitao Liu · Li Yi

[ Exhibit Hall I ]

Abstract
Synthesizing realistic human-object interaction motions is a critical problem in VR/AR and human animation. Unlike the commonly studied scenarios involving a single human or hand interacting with one object, we address a more generic multi-body setting with arbitrary numbers of humans, hands, and objects. The high correlations and mutual influences among bodies leads to two major challenges, for which we propose solutions. First, to satisfy the high demands for synchronization of different body motions, we mathematically derive a new set of alignment scores during the training process, and use maximum likelihood sampling on a dynamic graphical model for explicit synchronization during inference. Second, the high-frequency interactions between objects are often overshadowed by the large-scale low-frequency movements. To address this, we introduce frequency decomposition and explicitly represent high-frequency components in the frequency domain. Extensive experiments across five datasets with various multi-body configurations demonstrate the superiority of SyncDiff over existing state-of-the-art motion synthesis methods.
Poster
Zihan Cao · Yu Zhong · Ziqi Wang · Liang-Jian Deng

[ Exhibit Hall I ]

Abstract
Image fusion, a fundamental low-level vision task, aims to integrate multiple image sequences into a single output while preserving as much information as possible from the input. However, existing methods face several significant limitations: 1) requiring task- or dataset-specific models; 2) neglecting real-world image degradations (e.g., noise), which causes failure when processing degraded inputs; 3) operating in pixel space, where attention mechanisms are computationally expensive; and 4) lacking user interaction capabilities.To address these challenges, we propose a unified framework for multi-task, multi-degradation, and language-guided image fusion. Our framework includes two key components: 1) a practical degradation pipeline that simulates real-world image degradations and generates interactive prompts to guide the model; 2) an all-in-one Diffusion Transformer (DiT) operating in latent space, which fuses a clean image conditioned on both the degraded inputs and the generated prompts. Furthermore, we introduce principled modifications to the original DiT architecture to better suit the fusion task. Based on this framework, we develop two versions of the model: Regression-based and Flow Matching-based variants.Extensive qualitative and quantitative experiments demonstrate that our approach effectively addresses the aforementioned limitations and outperforms previous restoration+fusion and all-in-one pipelines.
Poster
Jiaqi Xu · Wenbo Li · Haoze Sun · Fan Li · Zhixin Wang · Long Peng · Jingjing Ren · HAORAN YANG · Xiaowei Hu · Renjing Pei · Pheng-Ann Heng

[ Exhibit Hall I ]

Abstract
Diffusion models (DMs) have demonstrated remarkable success in real-world image super-resolution (SR), yet their reliance on time-consuming multi-step sampling largely hinders their practical applications. While recent efforts have introduced few- or single-step solutions, existing methods either inefficiently model the process from noisy input or fail to fully exploit iterative generative priors, compromising the fidelity and quality of the reconstructed images. To address this issue, we propose FlowSR, a novel approach that reformulates the SR problem as a rectified flow from low-resolution (LR) to high-resolution (HR) images. Our method leverages an improved consistency learning strategy to enable high-quality SR in a single step. Specifically, we refine the original consistency distillation process by incorporating HR regularization, ensuring that the learned SR flow not only enforces self-consistency but also converges precisely to the ground-truth HR target. Furthermore, we introduce a fast-slow scheduling strategy, where adjacent timesteps for consistency learning are sampled from two distinct schedulers: a fast scheduler with fewer timesteps to improve efficiency, and a slow scheduler with more timesteps to capture fine-grained texture details. This strategy enhances the model's robustness, enabling accurate restoration even when mild perturbations occur in the flow trajectory. Extensive experiments demonstrate that FlowSR achieves outstanding performance in …
Poster
Jiefeng Li · Jinkun Cao · Haotian Zhang · Davis Rempe · Jan Kautz · Umar Iqbal · Ye Yuan

[ Exhibit Hall I ]

Abstract
Human motion modeling traditionally separates motion generation and estimation into distinct tasks with specialized models. Motion generation models focus on creating diverse, realistic motions from inputs like text, audio, or keyframes, while motion estimation models aim to reconstruct accurate motion trajectories from observations like videos. Despite sharing underlying representations of temporal dynamics and kinematics, this separation limits knowledge transfer between tasks and requires maintaining separate models. We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework. Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals. Leveraging the synergy between regression and diffusion, GENMO achieves accurate global motion estimation while enabling diverse motion generation. We also introduce an estimation-guided training objective that exploits in-the-wild videos with 2D annotations and text descriptions to enhance generative diversity. Furthermore, our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control. This unified approach creates synergistic benefits: generative priors improve estimated motions under challenging conditions like occlusions, while diverse video data enhances generation capabilities. Extensive experiments demonstrate GENMO's effectiveness as a generalist framework …
Poster
Xihua Wang · Xin Cheng · Yuyue Wang · Ruihua Song · Yunfeng Wang

[ Exhibit Hall I ]

Abstract
Video-to-audio (V2A) generation aims to synthesize temporally aligned, realistic sounds for silent videos, a critical capability for immersive multimedia applications. Current V2A methods, predominantly based on diffusion or flow models, rely on suboptimal noise-to-audio paradigms that entangle cross-modal mappings with stochastic priors, resulting in inefficient training and convoluted transport paths. We propose VAFlow, a novel flow-based framework that directly models the video-to-audio transformation, eliminating reliance on noise priors. To address modality discrepancies, we employ an alignment variational autoencoder (VAE) that compresses heterogeneous video features into audio-aligned latent spaces while preserving spatiotemporal semantics. By retaining cross-attention mechanisms between video features and flow blocks, our architecture enables classifier-free guidance within video source-driven generation. Without external data or complex training tricks, VAFlow achieves state-of-the-art performance on VGGSound benchmark, surpassing even text-augmented models in audio fidelity, diversity, and distribution alignment. This work establishes a new paradigm for V2A generation with a direct and effective video-to-audio transformation via flow matching.
Poster
Yixin Yang · jiawei zhang · Yang Zhang · Yunxuan Wei · Dongqing Zou · Jimmy Ren · Boxin Shi

[ Exhibit Hall I ]

Abstract
Events provide High Dynamic Range (HDR) intensity change which can guide Low Dynamic Range (LDR) image for HDR reconstruction. However, events only provide temporal intensity differences and it is still ill-posed in over-/under-exposed areas due to missing initial reference brightness and color information. With strong generation ability, diffusion models have shown their potential for tackling ill-posed problems. Therefore, we introduce conditional diffusion models to hallucinate missing information. Whereas, directly adopting events and LDR image as conditions is complicated for diffusion models to sufficiently utilize their information. Thus we introduce a pretrained events-image encoder tailored for HDR reconstruction and a pyramid fusion module to provide HDR conditions, which can be efficiently and effectively utilized by the diffusion model. Moreover, the generation results of diffusion models usually exhibit distortion, particularly for fine-grained details. To better preserve fidelity and suppress distortion, we propose a fine-grained detail recovery approach using a histogram-based structural loss. Experiments on real and synthetic data show the effectiveness of the proposed method in terms of both detail preservation and information hallucination.
Poster
Yifan Liu · Shengjun Zhang · Chensheng Dai · Yang Chen · Hao Liu · Chen Li · Yueqi Duan

[ Exhibit Hall I ]

Abstract
Modeling animatable human avatars from videos is a long-standing and challenging problem. While conventional methods require per-instance optimization, recent feed-forward methods have been proposed to generate 3D Gaussians with a learnable network.However, these methods predict independent Gaussians for each frame prediction without fully capturing the relations of Gaussians from different frames, which are hard to be animated by novel poses. To address this, we propose Human Gaussian Graph (HGG) to generate generalizable and animatable Gaussian representations. Specifically, we construct a dual-layer graph to model the relations between predicted Gaussians from multiple frames and SMPL mesh. We design an intra-node operation to aggregate various Gaussian information at different timesteps to benefit from video inputs. Furthermore, we propose an inter-node operation to support message passing between SMPL vertices. In this manner, we leverage the human structure prior to recover generalizable and animatable Gaussian representations.Experimental results on novel view synthesis and novel pose animation demonstrate the efficiency and generalization of our method.
Poster
Hongdi Yang · Chengyang Li · Zhenxuan Wu · Gaozheng Li · Jingya Wang · Jingyi Yu · Zhuo Su · Lan Xu

[ Exhibit Hall I ]

Abstract
Soccer is a globally renowned sport with significant applications in video games and VR/AR. However, generating realistic soccer motions remains challenging due to the intricate interactions between the player and the ball. In this paper, we introduce SMGDiff, a novel two-stage framework for generating real-time and user-controllable soccer motions. Our key idea is to integrate real-time character animation with a powerful diffusion-based generative model. Specifically, we first map coarse user control to intricate character trajectories. Then, we employ a transformer-based autoregressive diffusion model to generate soccer motions based on trajectory conditioning. For further physical realism, we integrate a contact guidance module during inference to refine precise ball-foot interactions.Additionally, we contribute a large-scale soccer motion dataset consisting of over 1.08 million frames of diverse soccer motions. Extensive experiments demonstrate that our SMGDiff significantly outperforms existing methods in terms of motion quality and condition alignment.
Poster
Yilin Wei · Mu Lin · Yuhao Lin · Jian-Jian Jiang · Xiao-Ming Wu · Ling-An Zeng · Wei-Shi Zheng

[ Exhibit Hall I ]

Abstract
Language-guided robot dexterous generation enables robots to grasp and manipulate objects based on human commands. However, previous data-driven methods are hard to understand intention and execute grasping with unseen categories in the open set. In this work, we explore a new task, Open-set Language-guided Dexterous Grasp, and find that the main challenge is the huge gap between high-level human language semantics and low-level robot action. To solve this problem, we propose an Affordance Dexterous Grasp (AffordDexGrasp) framework, with the insight that bridging the gap with a new generalizable-instructive affordance representation. This affordance can generalize to unseen categories by leveraging the object's local structure and category-agnostic semantic attributes, thereby effectively guiding dexterous grasp generation. Built upon the affordance, our framework introduces Affordacne Flow Matching (AFM) for affordance generation with language as input, and Grasp Flow Matching (GFM) for generating dexterous grasp with affordance as input. To evaluate our framework, we build an open-set table-top language-guided dexterous grasp dataset. Extensive experiments in the simulation and real worlds show that our framework surpasses all previous methods in both seen category and unseen category generalization.
Poster
Jiahua Dong · Hui Yin · Wenqi Liang · Hanbin Zhao · Henghui Ding · Nicu Sebe · Salman Khan · Fahad Khan

[ Exhibit Hall I ]

Abstract
Video instance segmentation (VIS) has gained significant attention for its capability in segmenting and tracking object instances across video frames. However, most of the existing VIS methods unrealistically assume that the categories of object instances remain fixed over time. Moreover, they experience catastrophic forgetting of old classes when required to continuously learn object instances belonging to new classes. To address the above challenges, we develop a novel Hierarchical Visual Prompt Learning (HVPL) model, which alleviates catastrophic forgetting of old classes from both frame-level and video-level perspectives. Specifically, to mitigate forgetting at the frame level, we devise a task-specific frame prompt and an orthogonal gradient correction (OGC) module. The OGC module helps the frame prompt encode task-specific global instance information for new classes in each individual frame by projecting its gradients onto the orthogonal feature space of old classes. Furthermore, to address forgetting at the video level, we design a task-specific video prompt and a video context decoder. This decoder first embeds structural inter-class relationships across frames into the frame prompt feature, and then propagates task-specific global video contexts from the frame prompt features to the video prompt. Experiments verify the effectiveness of our HVPL model compared to other methods.
Poster
Jinghan You · Shanglin Li · Yuanrui Sun · Jiangchuanwei Wei · Mingyu Guo · Chao Feng · Jiao Ran

[ Exhibit Hall I ]

Abstract
Vision Transformers (ViTs) have revolutionized large-scale visual modeling, yet remain underexplored in face recognition (FR) where CNNs still dominate. We identify a critical bottleneck: CNN-inspired training paradigms fail to unlock ViT's potential, leading to suboptimal performance and convergence instability.To address this challenge, we propose LVFace, a ViT-based FR model that integrates Progressive Cluster Optimization (PCO) to achieve superior results. Specifically, PCO sequentially applies negative class sub-sampling (NCS) for robust and fast feature alignment from random initialization, feature expectation penalties for centroid stabilization, performing cluster boundary refinement through full-batch training without NCS constraints. LVFace establishes a new state-of-the-art face recognition baseline, surpassing leading approaches such as UniFace and TopoFR across multiple benchmarks. Extensive experiments demonstrate that LVFace delivers consistent performance gains, while exhibiting scalability to large-scale datasets and compatibility with mainstream VLMs and LLMs. Notably, LVFace secured 1st place in the ICCV 2021 Masked Face Recognition (MFR)-Ongoing Challenge (March 2025), proving its efficacy in real-world scenarios.
Poster
Yongchuan Cui · Peng Liu · HUI ZHANG

[ Exhibit Hall I ]

Abstract
Existing deep learning-based models for remote sensing pansharpening exhibit exceptional performance on training datasets. However, due to sensor-specific characteristics and varying imaging conditions, these models suffer from substantial performance degradation when applied to unseen satellite data, lacking generalizability and thus limiting their applicability. We argue that the performance drops stem primarily from distributional discrepancies from different sources and the key to addressing this challenge lies in bridging the gap between training and testing distributions. To validate the idea and further achieve a “train once, deploy forever” capability, this paper introduces a novel and intuitive approach to enpower any pansharpening models with generalizability by employing a unified distribution strategy (UniPAN). Specifically, we construct a distribution transformation function that normalizes the pixels sampled from different sources to conform to an identical distribution. The deep models are trained on the transformed domain, and during testing on new datasets, the new data are also transformed to match the training distribution. UniPAN aims to train and test the model on a unified and consistent distribution, thereby enhancing its generalizability. Extensive experiments validate the efficacy of UniPAN, demonstrating its potential to significantly enhance the performance of deep pansharpening models across diverse satellite sensors. Codes will be …
Poster
Yanchen Liu · Yanan SUN · Zhening Xing · Junyao Gao · Kai Chen · Wenjie Pei

[ Exhibit Hall I ]

Abstract
Existing text-to-video methods struggle to transfer motion smoothly from a reference object to a target object with significant differences in appearance or structure between them. To address this challenge, we introduce MotionShot, a training-free framework capable of parsing reference-target correspondences in a fine-grained manner, thereby achieving high-fidelity motion transfer while preserving coherence in appearance. To be specific, MotionShot first performs semantic feature matching to ensure high-level alignments between the reference and target objects. It then further establishes low-level morphological alignments through reference-to-target shape retargeting. By encoding motion with temporal attention, our MotionShot can coherently transfer motion across objects, even in the presence of significant appearance and structure disparities, demonstrated by extensive experiments. Code will be released.
Poster
Yuhwan Jeong · Yunseo Yang · Youngho Yoon · Kuk-Jin Yoon

[ Exhibit Hall I ]

Abstract
Adverse weather conditions cause diverse and complex degradation patterns, driving the development of All-in-One (AiO) models.However, recent AiO solutions still struggle to capture diverse degradations, since global filtering methods like direct operations on the frequency domain fail to handle highly variable and localized distortions.To address these issue, we propose Spectral-based Spatial Grouping Transformer (SSGformer), a novel approach that leverages spectral decomposition and group-wise attention for multi-weather image restoration. SSGformer decomposes images into high-frequency edge features using conventional edge detection and low-frequency information via Singular Value Decomposition.We utilize multi-head linear attention to effectively model the relationship between these features.The fused features are integrated with the input to generate a grouping-mask that clusters regions based on the spatial similarity and image texture. To fully leverage this mask, we introduce a group-wise attention mechanism, enabling robust adverse weather removal and ensuring consistent performance across diverse weather conditions.We also propose a Spatial Grouping Transformer Block that uses both channel attention and spatial attention, effectively balancing feature-wise relationships and spatial dependencies.Extensive experiments show the superiority of our approach, validating its effectiveness in handling the varied and intricate adverse weather degradations. The code will be available soon.
Poster
Gavriel Habib · Noa Barzilay · Or Shimshi · Rami Ben-Ari · Nir Darshan

[ Exhibit Hall I ]

Abstract
Gait recognition is a computer vision task that identifies individuals based on their walking patterns. Gait recognition performance is commonly evaluated by ranking a gallery of candidates and measuring the accuracy at the top Rank-$K$. Existing models are typically single-staged, i.e. searching for the probe's nearest neighbors in a gallery using a single global feature representation. Although these models typically excel at retrieving the correct identity within the top-$K$ predictions, they struggle when hard negatives appear in the top short-list, leading to relatively low performance at the highest ranks (e.g., Rank-1). In this paper, we introduce CarGait, a Cross-Attention Re-ranking method for gait recognition, that involves re-ordering the top-$K$ list leveraging the fine-grained correlations between pairs of gait sequences through cross-attention between gait strips. This re-ranking scheme can be adapted to existing single-stage models to enhance their final results. We demonstrate the capabilities of CarGait by extensive experiments on three common gait datasets, Gait3D, GREW, and OU-MVLP, and seven different gait models, showing consistent improvements in Rank-1,5 accuracy, superior results over existing re-ranking methods, and strong baselines.
Poster
Wontae Kim · Keuntek Lee · Nam Ik Cho

[ Exhibit Hall I ]

Abstract
A 3D lookup table (3D LUT) is a classic yet effective tool for image enhancement and restoration tasks, even in the deep learning era. The 3D LUT efficiently reduces model size and runtime by instantly transforming an input color value into another color value through interpolation of pre-calculated values at the vertices. However, a limitation of 3D LUT transforms is their lack of spatial information, as they convert color values on a point-by-point basis. To address this weakness, researchers have explored spatial-aware 3D LUT methods, which provide spatial features through additional modules. While spatial-aware 3D LUT methods show promising performance, the extra modules introduce a substantial number of parameters and an increased runtime, particularly as the resolution of the input image rises. To tackle this issue, we propose a method for generating image-adaptive 3D LUTs by considering the redundant parts of tables. We introduce an efficient framework that decomposes the 3D LUT into a linear sum of low-dimensional LUTs and utilizes singular value decomposition (SVD). Additionally, we modify the modules for spatial features to be more cache-efficient and image-adaptive, thereby reducing both runtime and improving performance. Our model effectively reduces the number of parameters and runtime, while maintaining competitive performance, …
Poster
Jiwoo Park · Tae Choi · Youngjun Jun · Seong Jae Hwang

[ Exhibit Hall I ]

Abstract
Generating high-quality novel views of a scene from a single image requires maintaining structural coherence across different views, referred to as view consistency.While diffusion models have driven advancements in novel view synthesis, they still struggle to preserve spatial continuity across views. Diffusion models have been combined with 3D models to address the issue, but such approaches lack efficiency due to their complex multi-step pipelines.This paper proposes a novel view-consistent image generation method which utilizes diffusion models without additional modules. Our key idea is to enhance diffusion models with a training-free method that enables adaptive attention manipulation and noise reinitialization by leveraging view-guided warping to ensure view consistency. Through our comprehensive metric framework suitable for novel-view datasets, we show that our method improves view consistency across various diffusion models, demonstrating its broader applicability.
Poster
Haoyu Yao · Bin Yang · Wenke Huang · Mang Ye · Bo Du

[ Exhibit Hall I ]

Abstract
Unsupervised visible-infrared person re-identification (USL-VI-ReID) aims to train a cross-modality retrieval model without labels, reducing the reliance on expensive cross-modality manual annotation. However, existing USL-VI-ReID methods rely on artificially cross-modality paired data as implicit supervision, which is also expensive for human annotation and contrary to the setting of unsupervised tasks. In addition, this full alignment of identity across modalities is inconsistent with real-world scenarios, where unpaired settings are prevalent. To this end, we study the USL-VI-ReID task under unpaired settings, which uses cross-modality unpaired and unlabeled data for training a VI-ReID model. We propose a novel Mapping and Collaborative Learning (MCL) framework. Specifically, we first design a simple yet effective Cross-modality Feature Mapping (CFM) module to map and generate fake cross-modality positive feature pairs, constructing a cross-modal pseudo-identity space for feature alignment. Then, a Static-Dynamic Collaborative (SDC) learning strategy is proposed to align cross-modality correspondences through a collaborative approach, eliminating inter-modality discrepancies across different aspects i.e., cluster-level and instance-level, in scenarios with cross-modal identity mismatches. Extensive experiments on the conducted SYSU-MM01 and RegDB benchmarks under paired and unpaired settings demonstrate that our proposed MCL significantly outperforms existing unsupervised methods, facilitating USL-VI-ReID to real-world deployment.
Poster
Jiang Yuan · ji ma · Bo Wang · Guanzhou Ke · Weiming Hu

[ Exhibit Hall I ]

Abstract
Implicit degradation estimation-based blind super-resolution (IDE-BSR) hinges on extracting the implicit degradation representation (IDR) of the LR image and adapting it to LR image features to guide HR detail restoration. Although IDE-BSR has shown potential in dealing with noise interference and complex degradations, existing methods ignore the importance of IDR discriminability for BSR and instead over-complicate the adaptation process to improve effect, resulting in a significant increase in the model's parameters and computations. In this paper, we focus on the discriminability optimization of IDR and propose a new powerful and lightweight BSR model termed LightBSR. Specifically, we employ a knowledge distillation-based learning framework. We first introduce a well-designed degradation-prior-constrained contrastive learning technique during teacher stage to make the model more focused on distinguishing different degradation types. Then we utilize a feature alignment technique to transfer the degradation-related knowledge acquired by the teacher to the student for practical inferencing. Extensive experiments demonstrate the effectiveness of IDR discriminability-driven BSR model design. The proposed LightBSR can achieve outstanding performance with minimal complexity across a range of blind SR tasks.
Poster
Zhenzhi Wang · Yixuan Li · yanhong zeng · Yuwei Guo · Dahua Lin · Tianfan Xue · Bo Dai

[ Exhibit Hall I ]

Abstract
Generating human videos from a single image while ensuring high visual quality and precise control is a challenging task, especially in complex scenarios involving multiple individuals and interactions with objects. Existing methods, while effective for single-human cases, often fail to handle the intricacies of multi-identity interactions because they struggle to associate the correct pairs of human appearance and pose condition and model the distribution of 3D-aware dynamics. To address these limitations, we present Structural Video Diffusion, a novel framework designed for generating realistic multi-human videos. Our approach introduces two core innovations: identity-specific embeddings to maintain consistent appearances across individuals and a structural learning mechanism that incorporates depth and surface-normal cues to model human-object interactions. Additionally, we expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios, providing a robust foundation for training. Experimental results demonstrate that Structural Video Diffusion achieves superior performance in generating lifelike, coherent videos for multiple subjects with dynamic and rich interactions, advancing the state of human-centric video generation.
Poster
Wentao Zhu · Zhining Zhang · Yuwei Ren · Yin Huang · Hao Xu · Yizhou Wang

[ Exhibit Hall I ]

Abstract
Mirror neurons are a class of neurons that activate both when an individual observes an action and when they perform the same action. This mechanism reveals a fundamental interplay between action understanding and embodied execution, suggesting that these two abilities are inherently connected. Nonetheless, existing machine learning methods largely overlook this interplay, treating these abilities as separate tasks. In this study, we provide a unified perspective in modeling them through the lens of representation learning. We first observe that their intermediate representations spontaneously align. Inspired by mirror neurons, we further introduce an approach that explicitly aligns the representations of observed and executed actions. Specifically, we employ two linear layers to map the representations to a shared latent space, where contrastive learning enforces the alignment of corresponding representations, effectively maximizing their mutual information. Experiments demonstrate that this simple approach fosters mutual synergy between the two tasks, effectively improving representation quality and generalization.
Poster
Yixiang Chen · Peiyan Li · Yan Huang · Jiabing Yang · Kehan Chen · Liang Wang

[ Exhibit Hall I ]

Abstract
Current language-guided robotic manipulation systems often require low-level action-labeled datasets for imitation learning. While object-centric flow prediction methods mitigate this issue, they remain limited to scenarios involving rigid objects with clear displacement and minimal occlusion. In this work, we present Embodiment-Centric Flow (EC-Flow), a framework that directly learns manipulation from action-unlabeled videos by predicting embodiment-centric flow. Our key insight is that incorporating the embodiment's inherent kinematics significantly enhances generalization to versatile manipulation scenarios, including deformable object handling, occlusions, and non-object-displacement tasks. To connect the EC-Flow with language instructions and object interactions, we further introduce a goal-alignment module by jointly optimizing movement consistency and goal-image prediction. Moreover, translating EC-Flow to executable robot actions only requires a standard robot URDF (Unified Robot Description Format) file to specify kinematic constraints across joints, which makes it easy to use in practice. We validate EC-Flow on both simulation (Meta-World) and real-world tasks, demonstrating its state-of-the-art performance in occluded object handling (62% improvement), deformable object manipulation (45% improvement), and non-object-displacement tasks (80% improvement) than prior state-of-the-art object-centric flow methods.
Poster
Sagnik Majumder · Tushar Nagarajan · Ziad Al-Halah · Kristen Grauman

[ Exhibit Hall I ]

Abstract
We introduce Switch-a-view, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video. The key insight of our approach is how to train such a model from unlabeled---but human-edited---video samples. We pose a pretext task that pseudo-labels segments in the training videos for their primary viewpoint (egocentric or exocentric), and then discovers the patterns between the visual and spoken content in a how-to video on the one hand and its view-switch moments on the other hand. Armed with this predictor, our model can be applied to new multi-view video settings for orchestrating which viewpoint should be displayed when, even when such settings come with limited labels. We demonstrate our idea on a variety of real-world videos from HowTo100M and Ego-Exo4D, and rigorously validate its advantages.
Poster
Dongming Wu · Yanping Fu · Saike Huang · Yingfei Liu · Fan Jia · Nian Liu · Feng Dai · Tiancai Wang · Rao Anwer · Fahad Khan · Jianbing Shen

[ Exhibit Hall I ]

Abstract
General robotic grasping systems require accurate object affordance perception in diverse open-world scenarios following human instructions. However, current studies suffer from the problem of lacking reasoning-based large-scale affordance prediction data, leading to considerable concern about open-world effectiveness. To address this limitation, we build a large-scale grasping-oriented affordance segmentation benchmark with human-like instructions, named RAGNet. It contains 273k images, 180 categories, and 26k reasoning instructions. The images cover diverse embodied data domains, such as wild, robot, ego-centric, and even simulation data. They are carefully annotated with an affordance map, while the difficulty of language instructions is largely increased by removing their category name and only providing functional descriptions. Furthermore, we propose a comprehensive affordance-based grasping framework, named AffordanceNet, which consists of a VLM pre-trained on our massive affordance data and a grasping network that conditions an affordance map to grasp the target. Extensive experiments on affordance segmentation benchmarks and real-robot manipulation tasks show that our model has a powerful open-world generalization ability. Our code and benchmark will be released.
Poster
Maksim Siniukov · Di Chang · Minh Tran · Hongkun Gong · Ashutosh Chaubey · Mohammad Soleymani

[ Exhibit Hall I ]

Abstract
Generating naturalistic and nuanced listener motions for extended interactions remains an open problem.Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness.To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen.Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8\% in FID on RealTalk) and motion representation (+6.1\% in FD metric on VICO) spaces.User studies confirm the superior …
Poster
Shuang Xu · Zixiang Zhao · Haowen Bai · Chang Yu · Jiangjun Peng · Xiangyong Cao · Deyu Meng

[ Exhibit Hall I ]

Abstract
Hyperspectral images (HSIs) are frequently noisy and of low resolution due to the constraints of imaging devices. Recently launched satellites can concurrently acquire HSIs and panchromatic (PAN) images, enabling the restoration of HSIs to generate clean and high-resolution imagery through fusing PAN images for denoising and super-resolution. However, previous studies treated these two tasks as independent processes, resulting in accumulated errors. This paper introduces Hyperspectral Image Joint Pandenoising and Pansharpening (Hipandas), a novel learning paradigm that reconstructs HRHS images from noisy low-resolution HSIs (LRHS) and high-resolution PAN images. The proposed unsupervised Hipandas framework consists of a guided denoising network, a guided super-resolution network, and a PAN reconstruction network, utilizing an HSI low-rank prior and a newly introduced detail-oriented low-rank prior. The interconnection of these networks complicates the training process, necessitating a two-stage training strategy to ensure effective training. Experimental results on both simulated and real-world datasets indicate that the proposed method surpasses state-of-the-art algorithms, yielding more accurate and visually pleasing HRHS images.
Poster
Runqi Wang · Caoyuan Ma · Guopeng Li · Hanrui Xu · Yuke Li · Zheng Wang

[ Exhibit Hall I ]

Abstract
Text to Motion aims to generate human motions from texts. Existing settings rely on limited Action Texts that include action labels (e.g., "walk, bend"), which limits flexibility and practicability in scenarios difficult to describe directly. This paper extends limited Action Texts to arbitrary ones. Scene texts without explicit action labels can enhance the practicality of models in complex and diverse industries such as virtual human interaction, robot behavior generation, and film production, while also supporting the exploration of potential implicit behavior patterns. However, newly introduced Scene Texts may yield multiple reasonable output results, causing significant challenges in existing data, framework, and evaluation.To address this practical issue, we first create a new dataset, HumanML3D++, by extending texts of the largest existing dataset, HumanML3D. Secondly, we propose a simple yet effective framework that extracts action instructions from arbitrary texts and subsequently generates motions. Furthermore, we also benchmark this new setting with multi-solution metrics to address the inadequacies of existing single-solution metrics. Extensive experiments indicate that Text to Motion in this realistic setting is challenging, fostering new research in this practical direction. Our data, model, and code will be released.
Poster
Quang Nguyen · Nhat Le · Baoru Huang · Minh VU · Chengcheng Tang · Van Nguyen · Ngan Le · Thieu Vo · Anh Nguyen

[ Exhibit Hall I ]

Abstract
Estimating human dance motion is a challenging task with various industrial applications. Recently, many efforts have focused on predicting human dance motion using either egocentric video or music as input. However, the task of jointly estimating human motion from both egocentric video and music remains largely unexplored. In this paper, we aim to develop a new method that predicts human dance motion from both egocentric video and music. In practice, the egocentric view often obscures much of the body, making accurate full-pose estimation challenging. Additionally, incorporating music requires the generated head and body movements to align well with both visual and musical inputs. We first introduce EgoAIST++, a new large-scale dataset that combines both egocentric views and music with more than 36 hours of dancing motion. Drawing on the success of diffusion models and Mamba on modeling sequences, we develop an EgoMusic Motion Network with a core Skeleton Mamba that explicitly captures the skeleton structure of the human body. We illustrate that our approach is theoretically supportive. Intensive experiments show that our method clearly outperforms state-of-the-art approaches and generalizes effectively to real-world data.
Poster
Gwanghyun Kim · Suh Jeon Jeon · Seunggyu Lee · Se Young Chun

[ Exhibit Hall I ]

Abstract
Personalized image generation has been significantly advanced, enabling the creation of highly realistic and customized images. However, existing methods often struggle with generating images of multiple people due to occlusions and fail to accurately personalize full-body shapes. In this paper, we propose PersonaCraft, a novel approach that combines diffusion models with 3D human modeling to address these limitations. Our method effectively manages occlusions by incorporating 3D-aware pose conditioning with SMPLx-ControlNet and accurately personalizes human full-body shapes through SMPLx fitting. Additionally, PersonaCraft enables user-defined body shape adjustments, adding flexibility for individual body customization. Experimental results demonstrate the superior performance of PersonaCraft in generating high-quality, realistic images of multiple individuals while resolving occlusion issues, thus establishing a new standard for multi-person personalized image synthesis.
Poster
Yang JingYi · Xun Lin · Zitong YU · Liepiao Zhang · Xin Liu · Hui Li · Xiaochen Yuan · Xiaochun Cao

[ Exhibit Hall I ]

Abstract
With the availability of diverse sensor modalities (i.e., RGB, Depth, Infrared) and the success of multi-modal learning, multi-modal face anti-spoofing (FAS) has emerged as a prominent research focus. The intuition behind it is that leveraging multiple modalities can uncover more intrinsic spoofing traces. However, this approach presents more risk of misalignment. We identify two main types of misalignment: (1) **Intra-domain modality misalignment**, where the importance of each modality varies across different attacks. For instance, certain modalities (e.g., Depth) may be non-defensive against specific attacks (e.g., 3D mask), indicating that each modality has unique strengths and weaknesses in countering particular attacks. Consequently, simple fusion strategies may fall short. (2) **Inter-domain modality misalignment**, where the introduction of additional modalities exacerbates domain shifts, potentially overshadowing the benefits of complementary fusion. To tackle (1), we propose a fusion module based on mutual information maximization, which adaptively enhances favorable modalities while suppressing unfavorable ones. To address (2), we employ a dual alignment optimization method that aligns both sub-domain hyperplanes and modality angle margins, thereby mitigating domain gaps. Our method, dubbed **D**ual **A**lignment of **D**omain and **M**odality (DADM), achieves state-of-the-art performance in extensive experiments across four challenging protocols demonstrating its robustness in multi-modal domain generalization scenarios. …
Poster
hanwen Zhang · Congqi Cao · Qinyi Lv · Lingtong Min · Yanning Zhang

[ Exhibit Hall I ]

Abstract
Video anomaly detection (VAD) is an important computer vision problem. Thanks to the mode coverage capabilities of generative models, the likelihood-based paradigm is catching growing interest, as it can model normal distribution and detect out-of-distribution anomalies. However, these likelihood-based methods are blind to the anomalies located in local modes near the learned distribution. To handle these ``unseen" anomalies, we dive into three gaps uniquely existing in VAD regarding scene, motion and appearance. Specifically, we first build a noise-conditioned score transformer for denoising score matching. Then, we introduce a scene-dependent and motion-aware score function by embedding the scene condition of input sequences into our model and assigning motion weights based on the difference between key frames of input sequences. Next, to solve the problem of blindness in principle, we integrate unaffected visual information via a novel autoregressive denoising score matching mechanism for inference. Through autoregressively injecting intensifying Gaussian noise into the denoised data and estimating the corresponding score function, we compare the denoised data with the original data to get a difference and aggregate it with the score function for an enhanced appearance perception and accumulate the abnormal context. With all three gaps considered, we can compute a more comprehensive anomaly …
Poster
Shaowei Liu · chuan guo · Bing Zhou · Jian Wang

[ Exhibit Hall I ]

Abstract
Close-proximity human-human interactive poses convey rich contextual information about interaction dynamics. Given such poses, humans can intuitively infer the context and anticipate possible past and future dynamics, drawing on strong priors of human behavior. Inspired by this observation, we propose Ponimator, a simple framework anchored on proximal interactive poses for versatile interaction animation. Our training data consists of close-contact two-person poses and their surrounding temporal context from motion-capture interaction datasets. Leveraging interactive pose priors, Ponimator employs two conditional diffusion models: (1) a pose animator that uses the temporal prior to generate dynamic motion sequences from interactive poses, and (2) a pose generator that applies the spatial prior to synthesize interactive poses from a single pose, text, or both when interactive poses are unavailable. Collectively, Ponimator supports diverse tasks, including image-based interaction animation, reaction animation, and text-to-interaction synthesis, facilitating the transfer of interaction knowledge from high-quality mocap data to open-world scenarios. Empirical experiments across diverse datasets and applications demonstrate the universality of the pose prior and the effectiveness and robustness of our framework.
Poster
Yingyan Xu · Kate Gadola · Prashanth Chandran · Sebastian Weiss · Markus Gross · Gaspard Zoss · Derek Bradley

[ Exhibit Hall I ]

Abstract
We present a new method for reconstructing the appearance properties of human faces from a lightweight capture procedure in an unconstrained environment. Our method recovers the surface geometry, diffuse albedo, specular intensity and specular roughness from a monocular video containing a simple head rotation in-the-wild. Notably, we make no simplifying assumptions on the environment lighting, and we explicitly take visibility and occlusions into account. As a result, our method can produce facial appearance maps that approach the fidelity of studio-based multi-view captures, but with a far easier and cheaper procedure.
Poster
Tobias Kirschstein · Javier Romero · Artem Sevastopolsky · Matthias Nießner · Shunsuke Saito

[ Exhibit Hall I ]

Abstract
Traditionally, creating photo-realistic 3D head avatars requires a studio-level multi-view capture setup and expensive optimization during test-time, limiting the use of digital human doubles to the VFX industry or offline renderings. To address this shortcoming, we present Avat3r, which regresses a high-quality and animatable 3D head avatar from just a few input images, vastly reducing compute requirements during inference. More specifically, we make Large Reconstruction Models animatable and learn a powerful prior over 3D human heads from a large multi-view video dataset. For better 3D head reconstructions, we employ position maps from DUSt3R and generalized feature maps from the human foundation model Sapiens. To animate the 3D head, our key discovery is that simple cross-attention to an expression code is already sufficient. Finally, we increase robustness by feeding input images with different expressions to our model during training, enabling the reconstruction of 3D head avatars from inconsistent inputs, e.g., an imperfect phone capture with accidental movement, or frames from a monocular video. We compare Avat3r with current state-of-the-art methods for few-input and single-input scenarios, and find that our method has a competitive advantage in both tasks. Finally, we demonstrate the wide applicability of our proposed model, creating 3D head avatars …
Poster
Uzay Gökay · Federico Spurio · Dominik Bach · Juergen Gall

[ Exhibit Hall I ]

Abstract
Current state-of-the-art methods for skeleton-based temporal action segmentation are fully supervised and require annotated data, which is expensive to collect. In contrast, existing unsupervised temporal action segmentation methods have focused primarily on video data, while skeleton sequences remain underexplored, despite their relevance to real-world applications, robustness, and privacy-preserving nature. In this paper, we propose a novel approach for unsupervised skeleton-based temporal action segmentation. Our method utilizes a sequence-to-sequence temporal autoencoder that keeps the information of the different joints disentangled in the embedding space. The latent representation is then segmented into non-overlapping patches and quantized to obtain distinctive skeleton motion words, driving the discovery of semantically meaningful action clusters. We thoroughly evaluate our model on three widely used skeleton-based datasets, namely HuGaDB, LARa, and BABEL. Our results demonstrate that SMQ outperforms the current state-of-the-art unsupervised temporal action segmentation methods.
Poster
Quanhao Li · Zhen Xing · Rui Wang · Hui Zhang · Qi Dai · Zuxuan Wu

[ Exhibit Hall I ]

Abstract
Recent advances in video generation have led to remarkable improvements in visual quality and temporal coherence. Upon this, trajectory-controllable video generation has emerged to enable precise object motion control through explicitly defined spatial paths.However, existing methods struggle with complex object movements and multi-object motion control, resulting in imprecise trajectory adherence, poor object consistency, and compromised visual quality.Furthermore, these methods only support trajectory control in a single format, limiting their applicability in diverse scenarios.Additionally, there is no publicly available dataset or benchmark specifically tailored for trajectory-controllable video generation, hindering robust training and systematic evaluation.To address these challenges, we introduce MagicMotion, a novel image-to-video generation framework that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes. Given an input image and trajectories, MagicMotion seamlessly animates objects along defined trajectories while maintaining object consistency and visual quality.Furthermore, we present MagicData, a large-scale trajectory-controlled video dataset, along with an automated pipeline for annotation and filtering. We also introduce MagicBench, a comprehensive benchmark that assesses both video quality and trajectory control accuracy across different numbers of objects.Extensive experiments demonstrate that MagicMotion outperforms previous methods across various metrics.
Poster
Donglin Di · He Feng · Wenzhang SUN · Yongjia Ma · Hao Li · Chen Wei · Lei Fan · Tonghua Su · Xun Yang

[ Exhibit Hall I ]

Abstract
Human-centric generative models are becoming increasingly popular, giving rise to various innovative tools and applications, such as talking face videos conditioned on text or audio prompts. The core of these capabilities lies in powerful pretrained foundation models, trained on large-scale, high-quality datasets. However, many advanced methods rely on in-house data subject to various constraints, and other current studies fail to generate high-resolution face videos, which is mainly attributed to the significant lack of large-scale, high-quality face video datasets. In this paper, we introduce a human face video dataset, \textbf{DH-FaceVid-1K}. Our collection spans 1200 hours in total, encompassing 270,043 video samples from over 20,000 individuals. Each sample includes corresponding speech audio, facial keypoints, and text annotations. Compared to other publicly available datasets, ours distinguishes itself through its multi-ethnic coverage and high-quality comprehensive individual attributes. We establish multiple face video generation models supporting tasks such as text-to-video and image-to-video generation. In addition, we develop comprehensive benchmarks to validate the scaling law when using different proportions of our dataset. Our primary aim is to contribute a face video dataset, particularly addressing the underrepresentation of Asian faces in existing curated datasets and thereby enriching the global spectrum of face-centric data and mitigating demographic biases.
Poster
Qi Zhao · Xingyu Ni · Ziyu Wang · Feng Cheng · Ziyan Yang · Lu Jiang · Bohan Wang

[ Exhibit Hall I ]

Abstract
We investigate how to enhance the physical fidelity of video generation models by leveraging synthetic videos generated via standard computer graphics techniques. These rendered videos respect real-world physics -- such as maintaining 3D consistency -- thereby serving as a valuable resource that can potentially improve video generation models.To harness this potential, we propose a solution that curates and integrates synthetic data while introducing a method to transfer its physical realism to the model, minimizing unwanted artifacts. Through experiments on three representative tasks emphasizing physical consistency, we demonstrate its effectiveness in enhancing physical fidelity. While our model still lacks a deep understanding of physics, our work offers one of the first empirical demonstrations that synthetic video enhances physical fidelity in video synthesis.
Poster
Zepeng Su · zhulin liu · Zongyan Zhang · Tong Zhang · C.L.Philip Chen

[ Exhibit Hall I ]

Abstract
Face aging is a typical ill-posed problem influenced by various factors such as environment and genetics, leading to highly diverse outcomes. However, existing methods primarily rely on numerical age representations, making it difficult to accurately capture individual or group-level aging patterns. To address this, we introduce a novel disentangled face representation, where age features are modeled in the image modality—referred to as the Age Prompt—providing richer prior age information to constrain the generation results. To this end, we design an ID-age multi-task co-learning framework and propose the Bidirectional Adversarial Disentanglement(BAD) strategy. This strategy maximizes the disentanglement of ID and age representation through bidirectional adversarial learning, extracting their attribute-invariant representations. Based on this representation, we propose TimeBooth, a personalized face aging model capable of generating diverse and individualized aging results. To optimize training, we construct a cross-age hybrid data pipeline and introduce various training strategies. Finally, we propose the R-AgeMAE metric and validate our method through extensive experiments, demonstrating that TimeBooth outperforms existing methods in both diversity and controllability.
Poster
Hengyuan Zhang · Zhe Li · Xingqun Qi · Mengze Li · Muyi Sun · Siye Wang · Man Zhang · Sirui Han

[ Exhibit Hall I ]

Abstract
Generating coherent and diverse human dances from music signals has gained tremendous progress in animating virtual avatars. While existing methods enable dance synthesis directly, they overlook affording editable dance movements for users is more practical in real choreography scenes.Moreover, the lack of high-quality dance datasets incorporating iterative editing also limits addressing this challenge.To achieve this goal, we first construct $\textbf{DanceRemix}$, a large-scale multi-turn editable dance dataset comprising the prompt featuring over 12.6M dance frames and 42K pairs.In addition, we propose a novel framework for iterative and editable dance generation coherently aligned with given music signals, namely $\textbf{DanceEditor}$. Considering the dance motion should be both musical rhythmic and enable iterative editing by user descriptions, our framework is built upon a prediction-then-editing paradigm unifying multi-modal conditions.At the initial prediction stage, our framework improves the authority of generated results by directly modeling dance movements from tailored aligned music.Moreover, at the subsequent iterative editing stages, we incorporate text descriptions as conditioning information to draw the editable results through a specific-designed $\textbf{Cross-modality Edition Module (CEM)}$.Specifically, CEM adaptively integrates the initial prediction with music and text prompts as temporal motion cues to guide the synthesized sequences.Thereby the results display music harmonic while preserving fine-grained semantic alignment …
Poster
Bahri Batuhan Bilecen · Ahmet Berke Gokmen · Furkan Güzelant · Aysegul Dundar

[ Exhibit Hall I ]

Abstract
3D head stylization transforms realistic facial features into artistic representations, enhancing user engagement across applications such as gaming and virtual reality. While 3D-aware generators have made significant advancements, many 3D stylization methods primarily provide near-frontal views and struggle to preserve the unique identities of original subjects, often resulting in outputs that lack diversity and individuality. Leveraging the PanoHead model which provides 360-degree consistent renders, we propose a novel framework that employs negative log-likelihood distillation (LD) to enhance identity preservation and improve stylization quality. By integrating multi-view grid score and mirror gradients within the 3D GAN architecture and introducing a score rank weighing technique, our approach achieves substantial qualitative and quantitative improvements. Our findings not only advance the state of 3D head stylization but also provide valuable insights into effective distillation processes between diffusion models and GANs, focusing on the critical issue of identity preservation. Code will be publicly released.
Poster
Dongjin Kim · Jaekyun Ko · Muhammad Kashif Ali · Tae Hyun Kim

[ Exhibit Hall I ]

Abstract
Image denoising is a fundamental challenge in computer vision, with applications in photography and medical imaging. While deep learning–based methods have shown remarkable success, their reliance on specific noise distributions limits generalization to unseen noise types and levels. Existing approaches attempt to address this with extensive training data and high computational resources but still suffer from overfitting.To address these issues, we conduct image denoising utilizing dynamically generated kernels via efficient operations. This approach helps prevent overfitting and improve resilience to unseen noise. Repetition of this process greatly improves denoising performance. Our method leverages a Feature Extraction Module for robust noise-invariant features, and Global Statistics and Local Correlation Modules to capture comprehensive noise characteristics and structural correlations. The Kernel Prediction Module employs these cues to produce pixel-wise varying kernels adapted to local structures, which are then applied iteratively for denoising. This ensures both efficiency and superior restoration quality.Despite being trained on single-level Gaussian noise, our compact model ($\sim$ 0.04 M) excels across diverse noise types and levels, demonstrating the promise of iterative dynamic filtering for practical image denoising.
Poster
Rui Wang · Yimu Sun · Jingxing Guo · Huisi Wu · Jing Qin

[ Exhibit Hall I ]

Abstract
Accurate segmentation of cardiac chamber structures in echocardiogram sequences is of great significance for clinical diagnosis and treatment. The imaging noise, artifacts, and the deformation and motion of the heart pose challenges to segmentation algorithms. Existing methods based on convolutional neural networks, Transformers, and space-time memory have indeed improved segmentation accuracy to some extent, but they are often restricted by limited local receptive fields and insufficient temporal memory retrieval.In this paper, we propose a novel model for echocardiography video segmentation, called GDKVM. The model employs linear key-value associations (LKVA) to effectively model inter-frame correlations, and introduces the gated delta rule (GDR) to ideally store intermediate memory states. The key-pixel feature fusion (KPFF) module is designed to integrate local and global features at multiple scales, enhancing robustness against boundary blurring and noise interference. We validated GDKVM on two mainstream echocardiogram video datasets (CAMUS and EchoNet-Dynamic) and compared it with various state-of-the-art methods. Experimental results show that GDKVM outperforms existing approaches in terms of segmentation accuracy and robustness, while ensuring real-time performance. GDKVM provides more accurate and efficient cardiac chamber segmentation outcomes for clinical applications.The code will be released upon publication.
Poster
Yingjie Zhou · Jiezhang Cao · Zicheng Zhang · Farong Wen · Jiang Yanwei · Jun Jia · Xiaohong Liu · Xiongkuo Min · Guangtao Zhai

[ Exhibit Hall I ]

Abstract
Speech-driven methods for portraits are figuratively known as ``Talkers" because of their capability to synthesize speaking mouth shapes and facial movements. Especially with the rapid development of the Text-to-Image (T2I) models, AI-Generated Talking Heads (AGTHs) have gradually become an emerging digital human media. However, challenges persist regarding the quality of these talkers and AGTHs they generate, and comprehensive studies addressing these issues remain limited. To address this gap, this paper \textbf{presents the largest AGTH quality assessment dataset THQA-10K} to date, which selects 12 prominent T2I models and 14 advanced talkers to generate AGTHs for 14 prompts. After excluding instances where AGTH generation is unsuccessful, the THQA-10K dataset contains 10,457 AGTHs, which provides rich material for AGTH quality assessment. Then, volunteers are recruited to subjectively rate the AGTHs and give the corresponding distortion categories. In our analysis for subjective experimental results, we evaluate the performance of talkers in terms of generalizability and quality, and also expose the distortions of existing AGTHs. Finally, \textbf{an objective quality assessment method based on the first frame, Y-T slice and tone-lip consistency is proposed}. Experimental results show that this method can achieve state-of-the-art (SOTA) performance in AGTH quality assessment. The work in this paper will be …
Poster
Shengkai Sun · Zefan Zhang · Jianfeng Dong · Zhiyong Cheng · Xiaojun Chang · Meng Wang

[ Exhibit Hall I ]

Abstract
Recent advances in the masked autoencoder (MAE) paradigm have significantly propelled self-supervised skeleton-based action recognition. However, most existing approaches limit reconstruction targets to raw joint coordinates or their simple variants, resulting in computational redundancy and limited semantic representation. To address this, we propose a novel General Feature Prediction framework (GFP) for efficient mask skeleton modeling. Our key innovation is replacing conventional low-level reconstruction with high-level feature prediction that spans from local motion patterns to global semantic representations. Specifically, we introduce a collaborative learning framework where a lightweight target generation network dynamically produces diversified supervision signals across spatial-temporal hierarchies, avoiding reliance on pre-computed offline features. The framework incorporates constrained optimization to ensure feature diversity while preventing model collapse. Experiments on NTU RGB+D 60, NTU RGB+D 120 and PKU-MMD demonstrate the benefits of our approach: Computational efficiency (with 6.2$\times$ faster training than standard masked skeleton modeling methods) and superior representation quality, achieving state-of-the-art performance in various downstream tasks.
Poster
Tanay Agrawal · Abid Ali · Antitza Dantcheva · Francois Bremond

[ Exhibit Hall I ]

Abstract
Temporal Action Detection (TAD) is essential for analyzing long-form videos by identifying and segmenting actions within untrimmed sequences. While recent innovations like Temporal Informative Adapters (TIA) have improved resolution, memory constraints still limit large video processing. To address this, we introduce AdaTAD++, an enhanced framework that decouples temporal and spatial processing within adapters, organizing them into independently trainable modules. Our novel two-step training strategy first optimizes for high temporal and low spatial resolution, then vice versa, allowing the model to utilize both high spatial and temporal resolutions during inference while maintaining training efficiency. Additionally, we incorporate a more sophisticated temporal module capable of capturing long-range dependencies more effectively than previous methods. Extensive experiments on benchmark datasets, including ActivityNet-1.3, THUMOS14, and EPIC-Kitchens 100, demonstrate that AdaTAD++ achieves state-of-the-art performance, surpassing existing methods in accuracy and efficiency. We also explore various adapter configurations, discussing their trade-offs regarding resource constraints and performance, providing valuable insights into their optimal application.
Poster
Mahnoor Saad · Ziad Al-Halah

[ Exhibit Hall I ]

Abstract
How would the sound in a studio change with a carpeted floor and acoustic tiles on the walls? We introduce the task of material-controlled acoustic profile generation, where, given an indoor scene with specific audio-visual characteristics, the goal is to generate a target acoustic profile based on a user-defined material configuration at inference time. We address this task with a novel encoder-decoder approach that encodes the scene's key properties from an audio-visual observation and generates the target Room Impulse Response (RIR) conditioned on the material specifications provided by the user. Our model enables the generation of diverse RIRs based on various material configurations defined dynamically at inference time. To support this task, we create a new benchmark, the Acoustic Wonderland Dataset, designed for developing and evaluating material-aware RIR prediction methods under diverse and challenging settings. Our results demonstrate that the proposed model effectively encodes material information and generates high-fidelity RIRs, outperforming several baselines and state-of-the-art methods. Code and dataset will be released publicly upon acceptance.
Poster
YUE QIU · Yanjun Sun · Takuma Yagi · Shusaku Egami · Natsuki Miyata · Ken Fukuda · Kensho Hara · Ryusuke Sagawa

[ Exhibit Hall I ]

Abstract
Recognizing subtle similarities and differences among sets of similar activities is central to many real-world applications, including skill acquisition, sports performance evaluation, and anomaly detection. Humans excel at such fine-grained analysis, which requires comprehensive video understanding and cross-video reasoning about action attributes, poses, positions, and emotional states. Yet existing video-based large language models typically address only single-video recognition, leaving their capacity for multi-video reasoning largely unexplored.We introduce VideoSetBench, a curated benchmark designed to test detail-oriented recognition across diverse activities, from subtle action attributes to viewpoint transitions. Our evaluation of current video-based LLMs on VideoSetBench reveals critical shortcomings, particularly in fine-grained detail recognition and multi-video reasoning. To mitigate these issues, we propose an automatically generated dataset for instruction tuning alongside a novel multi-video recognition framework. While instruction tuning and specialized multi-video reasoning improve performance, all tested models remain far from satisfactory. These findings underscore the need for more robust video-based LLMs capable of handling complex multi-video tasks, enabling diverse real-world applications.
Poster
Bin Cao · Sipeng Zheng · Ye Wang · Lujie Xia · Qianshan Wei · Qin Jin · Jing Liu · Zongqing Lu

[ Exhibit Hall I ]

Abstract
Human motion generation holds significant potential for real-world applications. Despite recent advancements, existing vision-language-motion models (VLMMs) remain limited in achieving this goal. In this paper, we identify the lack of controllability as a critical bottleneck, where VLMMs struggle with diverse human commands, pose initialization, generation of long-term or unseen cases, and fine-grained control over individual body parts.To address these challenges, we introduce MotionCtrl, the first real-time, controllable VLMM with state-of-the-art performance.MotionCtrl achieves its controllability through training on HuMo100M, the largest human motion dataset to date, featuring over 5 million self-collected motions, 100 million multi-task instructional instances, and detailed part-level descriptions that address a long-standing gap in the field. Additionally, we propose a novel part-aware residual quantization technique for motion tokenization, enabling precise control over individual body parts during motion generation.Extensive experiments demonstrate MotionCtrl's superior performance across a wide range of motion benchmarks.Furthermore, we provide strategic design insights and a detailed time efficiency analysis to guide the development of practical motion generators. We believe the release of HuMo100M and MotionCtrl will significantly advance the motion community toward real-life applications. Code and data will be available at \url{https://anonymous.4open.science/r/MotionCtrl}.
Poster
Sunjae Yoon · Gwanhyeong Koo · Younghwan Lee · Ji Woo Hong · Chang Yoo

[ Exhibit Hall I ]

Abstract
3D animation aims to generate a 3D animated video from an input image and a target 3D motion sequence. Recent advances in image-to-3D models enable the creation of animations directly from user-hand drawings. Distinguished from conventional 3D animation, drawing-based 3D animation is crucial to preserve artist's unique style properties, such as rough contours and distinct stroke patterns. However, recent methods still exhibit quality deterioration in these style properties, especially under occlusions caused by overlapping body parts, leading to contour flickering and stroke blurring. This occurs due to a `stylization pose gap' between training and inference in stylization networks designed to preserve drawing styles in drawing-based 3D animation systems. The stylization pose gap denotes that input target poses used to train the stylization network are always in occlusion-free poses, while target poses encountered in an inference include diverse occlusions under dynamic motions. To this end, we propose Occlusion-robust Stylization Framework (OSF) for drawing-based 3D animation. Our investigation found that while employing object's edge can be effective input prior for guiding stylization, it becomes notably inaccurate when occlusions occur at inference. Thus, our proposed OSF provides occlusion-robust edge guidance for stylization network using optical flow, which ensure a consistent stylization even under …
Poster
Zheyuan Zhang · Weihao Tang · Hong Chen

[ Exhibit Hall I ]

Abstract
Micro-expression recognition (MER) is a highly challenging task in affective computing. With the reduced-sized micro-expression (ME) input that contains key information based on key-frame indexes, key-frame-based methods have significantly improved the performance of MER. However, most of these methods focus on improving the performance with relatively accurate key-frame indexes, while ignoring the difficulty of obtaining accurate key-frame indexes and the objective existence of key-frame index errors, which impedes them from moving towards practical applications. In this paper, we propose CausalNet, a novel framework to achieve robust MER facing key-frame index errors while maintaining accurate recognition. To enhance robustness, CausalNet takes the representation of the entire ME sequence as the input. To address the information redundancy brought by the complete ME range input and maintain accurate recognition, first, the Causal Motion Position Learning Module (CMPLM) is proposed to help the model locate the muscle movement areas related to Action Units (AUs), thereby reducing the attention to other redundant areas. Second, the Causal Attention Block (CAB) is proposed to deeply learn the causal relationships between the muscle contraction and relaxation movements in MEs. Moreover, due to its unique design, the model can maintain sensitivity to local information as the feature fusion deepens. …
Poster
Yaowu Fan · Jia Wan · Tao Han · Antoni Chan · Jinhua Ma

[ Exhibit Hall I ]

Abstract
Video Individual Counting (VIC) has received increasing attentions recently due to its importance in intelligent video surveillance. Existing works are limited in two aspects, i.e., dataset and method. Previous crowd counting datasets are captured with fixed or rarely moving cameras with relatively sparse individuals, restricting evaluation for a highly varying view and time in crowded scenes. While VIC methods have been proposed based on localization-then-association or localization-then-classification, they may not perform well due to difficulty in accurate localization of crowded and small targets under challenging scenarios. To address these issues, we collect a MovingDroneCrowd Dataset and propose a density map based VIC method. Different from existing datasets, our dataset consists of videos captured by fast-moving drones in crowded scenes under diverse illuminations, shooting heights and angles. Other than localizing individuals, we propose a Depth-wise Cross-Frame Attention (DCFA) module, which directly estimate inflow and outflow density maps to learn shared density between consecutive frames. The inflow density maps across frames are summed up to obtain the number of unique pedestrians in a video. Experiments on our datasets and publicly available ones the the superiority of our method over the state of the arts for VIC in highly dynamic and complex crowded …
Poster
Chi-Hsi Kung · Frangil Ramirez · Juhyung Ha · Yi-Hsuan Tsai · Yi-Ting Chen · David Crandall

[ Exhibit Hall I ]

Abstract
Understanding a procedural activity requires modeling both how action steps transform the scene, and how evolving scene transformations can influence the sequence of action steps, even those that are accidental or erroneous. Existing work has studied procedure-aware video representations by proposing novel approaches such as modeling the temporal order of actions and has not explicitly learned the state changes (scene transformations). In this work, we study procedure-aware video representation learning by incorporating state-change descriptions generated by Large Language Models (LLMs) as supervision signals for video encoders. Moreover, we generate state-change counterfactuals that simulate hypothesized failure outcomes, allowing models to learn by imagining the unseen ``What if'' scenarios. This counterfactual reasoning facilitates the model's ability to understand the cause and effect of each step in an activity. To verify the procedure awareness of our model, we conduct extensive experiments on procedure-aware tasks, including temporal action segmentation, error detection, and long-term action recognition. Our results demonstrate the effectiveness of the proposed state-change descriptions and their counterfactuals, and achieve significant improvements on multiple tasks. We will make our source code and data publicly available upon acceptance.
Poster
Shihao Zhou · Dayu Li · Jinshan Pan · Juncheng Zhou · Jinglei Shi · Jufeng Yang

[ Exhibit Hall I ]

Abstract
Transformer-based approaches have gained significant attention in image restoration, where the core component, i.e, Multi-Head Attention (MHA), plays a crucial role in capturing diverse features and recovering high-quality results. In MHA, heads perform attention calculation independently from uniform split subspaces, and a redundancy issue is triggered to hinder the model from achieving satisfactory outputs. In this paper, we propose to improve MHA by exploring diverse learners and introducing various interactions between heads, which results in a Hierarchical multI-head atteNtion driven Transformer model, termed HINT, for image restoration. HINT contains two modules, i.e., the Hierarchical Multi-Head Attention (HMHA) and the Query-Key Cache Updating (QKCU) module, to address the redundancy problem that is rooted in vanilla MHA. Specifically, HMHA extracts diverse contextual features by employing heads to learn from subspaces of varying sizes and containing different information. Moreover, QKCU, comprising intra- and inter-layer schemes, further reduces the redundancy problem by facilitating enhanced interactions between attention heads within and across layers. Extensive experiments are conducted on 12 benchmarks across 5 image restoration tasks, including low-light enhancement, dehazing, desnowing, denoising, and deraining, to demonstrate the superiority of HINT. The source code is available in the supplementary materials.
Poster
Zhanfeng Liao · Hanzhang Tu · Cheng Peng · Hongwen Zhang · Boyao Zhou · Yebin Liu

[ Exhibit Hall I ]

Abstract
We introduce HADES, the first framework to seamlessly integrate dynamic hair into human avatars. HADES represents hair as strands bound to 3D Gaussians, with roots attached to the scalp.By modeling inertial and velocity-aware motion, HADES is able to simulate realistic hair dynamics that naturally align with body movements.To enhance avatar fidelity, we incorporate multi-scale data and address color inconsistencies across cameras using a lightweight MLP-based correction module, which generates color correction matrices for consistent color tones. Besides, we resolve rendering artifacts, such as hair dilation during zoom-out, through a 2D Mip filter and physically constrained hair radii. Furthermore, a temporal fusion module is introduced to ensure temporal coherence by modeling historical motion states. Experimental results demonstrate that HADES achieves high-fidelity avatars with physically plausible hair dynamics, outperforming existing state-of-the-art solutions in realism and robustness.
Poster
Jeongsol Kim · Bryan Sangwoo Kim · Jong Ye

[ Exhibit Hall I ]

Abstract
Flow matching is a recent state-of-the-art framework for generative modeling based on ordinary differential equations (ODEs). While closely related to diffusion models, it provides a more general perspective on generative modeling.Although inverse problem solving has been extensively explored using diffusion models, it has not been rigorously examined within the broader context of flow models. Therefore, here we extend the diffusion inverse solvers (DIS)— which perform posterior sampling by combining a denoising diffusion prior with an likelihood gradient—into the flow framework. Specifically, by driving the flow-version of Tweedie's formula, we decompose the flow ODE into two components: one for clean image estimation and the other for noise estimation.By integrating the likelihood gradient and stochastic noise into each component, respectively, we demonstrate that posterior sampling for inverse problem solving can be effectively achieved using flows. Our proposed solver, Flow-Driven Posterior Sampling (FlowDPS), can also be seamlessly integrated into a latent flow model with a transformer architecture. Across four linear inverse problems, we confirm that FlowDPS outperforms state-of-the-art alternatives, all without requiring additional training.
Poster
Alireza Esmaeilzehi · Hossein Zaredar · Yapeng Tian · Laleh Seyyed-Kalantari

[ Exhibit Hall I ]

Abstract
Deep blind image super resolution (Blind SR) schemes strive to provide high performances under various image degradation processes. Despite the significant advancement in the area of Blind SR, the performances of these methods still may not be as high as one would desire in the case of real-world degradation operations. In this paper, we develop a novel diffusion-based Blind SR method, which, by leveraging compositional zero-shot learning, is able to provide superior performances for both synthetic and real-world unknown degradation processes. Specifically, we first extract both synthetic and real-world degradation embeddings from the input visual signal in a compositional zero-shot fashion. Next, we have efficiently embedded such degradation embeddings in the architecture of our diffusion-based scheme for guiding the diffusion feature generation process. The results of extensive experiments have demonstrated the effectiveness of the proposed Blind SR method over the state-of-the-art algorithms. Our source code and pre-trained models will be publicly available.
Poster
Hao Huang · Shuaihang Yuan · Geeta Chandra Raju Bethala · Congcong Wen · Anthony Tzes · Yi Fang

[ Exhibit Hall I ]

Abstract
Policy learning focuses on devising strategies for agents in embodied AI systems to perform optimal actions based on their perceived states. One of the key challenges in policy learning involves handling complex, long-horizon tasks that require managing extensive sequences of actions and observations. Wavelet analysis offers significant advantages in signal processing, notably in decomposing signals at multiple scales to capture both global trends and fine-grained details. In this work, we introduce a novel wavelet policy learning framework that utilizes wavelet transformations to enhance policy learning. Our approach leverages multi-scale wavelet decomposition to facilitate detailed observation analysis and robust action planning over extended sequences. We detail the design and implementation of our wavelet policy, which incorporates lifting schemes for effective multi-resolution analysis and action generation. This framework is evaluated across multiple complex scenarios, including robotic manipulation and self-driving, demonstrating our method's effectiveness in improving the learned policy's precision and reliability.
Poster
Xindi Yang · Baolu Li · Yiming Zhang · Zhenfei Yin · LEI BAI · Liqian Ma · Zhiyong Wang · Jianfei Cai · Tien-Tsin Wong · Huchuan Lu · Xu Jia

[ Exhibit Hall I ]

Abstract
Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. However, despite their capabilities, VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics, resulting in incorrect dynamics and event sequences. To address this limitation, we propose a novel two-stage image-to-video generation framework that explicitly incorporates physics. In the first stage, we employ a Vision Language Model (VLM) as a coarse-grained motion planner, integrating chain-of-thought and physics-aware reasoning to predict a rough motion trajectories/changes that approximate real-world physical dynamics while ensuring the inter-frame consistency. In the second stage, we use the predicted motion trajectories/changes to guide the video generation of a VDM. As the predicted motion trajectories/changes are rough, noise is added during inference to provide freedom to the VDM in generating motion with more fine details. Extensive experimental results demonstrate that our framework can produce physically plausible motion, and comparative evaluations highlight the notable superiority of our approach over existing methods.
Poster
Akio Kodaira · Chenfeng Xu · Toshiki Hazama · Takanori Yoshimoto · Kohei Ohno · Shogo Mitsuhori · Soichi Sugano · Hanying Cho · Zhijian Liu · Masayoshi Tomizuka · Kurt Keutzer

[ Exhibit Hall I ]

Abstract
We introduce StreamDiffusion, a real-time diffusion pipeline designed for streaming image generation. Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction. This limitation becomes particularly evident in scenarios involving continuous input, such as augmented/virtual reality, video game graphics rendering, live video streaming, and broadcasting, where high throughput is imperative. StreamDiffusion tackles this challenge through a novel pipeline-level system design. It employs unique strategies like batching the denoising process (Stream Batch), residual classifier-free guidance(R-CFG), and stochastic similarity filtering (SSF). Additionally, it seamlessly integrates advanced acceleration technologies for maximum efficiency. Specifically, Stream Batch reformulates the denoising process by eliminating the traditional wait-and-execute approach and utilizing a batching denoising approach, facilitating fluid and high-throughput streams. This results in 1.5x higher throughput compared to the conventional sequential denoising approach. R-CFG significantly addresses inefficiencies caused by repetitive computations during denoising. It optimizes the process to require minimal or no additional computations, leading to speed improvements of up to 2.05x compared to previous classifier-free methods. Besides, our stochastic similarity filtering dramatically lowers GPU activation frequency by halting computations for static image flows, achieving a remarkable reduction in computational consumption—2.39 times on an RTX 3060 …
Poster
Byungjun Kim · Shunsuke Saito · Giljoo Nam · Tomas Simon · Jason Saragih · Hanbyul Joo · Junxuan Li

[ Exhibit Hall I ]

Abstract
We present a universal prior model for 3D head avatar with hair compositionality. Existing approaches for building generalizable prior for 3D head avatar often model face and hair in a monolithic manner, where the inherent compositonality of the human head and hair is not considered. It is especially challenging for the monolithic model to self-discover the compositionality of face and hair when the dataset is not large enough. Moreover, extending the monolithic model for applications like swapping faces or hairstyles in 3D is not straightforward. Our prior model explicitly accounts for the compositionality of face and hair, learning their priors separately. To learn a disentangled latent spaces of face and hair of 3D head avatars, we propose a synthetic hairless data creation pipeline for dehairing the studio-captured dataset with estimated hairless geometry and hairless texture obtained from diffusion prior. Using a paired dataset of hair and hairless captures, disentangled prior models for face and hair can be trained by leveraging compositionality as an inductive bias to achieve disentanglement. Our model's inherent compositionality enables a seamless transfer of face and hair components between avatars while maintaining the subject's identity. Furthermore, we demonstrate that our model can be finetuned with a monocular …
Poster
Yibin Yan · Jilan Xu · Shangzhe Di · Yikun Liu · Yudi Shi · Qirui Chen · Zeqian Li · Yifei Huang · Weidi Xie

[ Exhibit Hall I ]

Abstract
Understanding continuous video streams plays a fundamental role in real-time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency decisions.To address these challenges, our main contributions are three-fold. (i) We develop a novel streaming video backbone, termed as **StreamFormer**, by incorporating causal temporal attention into a pre-trained vision transformer. This enables efficient streaming video processing while maintaining image representation capability. (ii) To train StreamFormer, we propose to unify diverse spatial-temporal video understanding tasks within a multitask visual-language alignment framework. Hence, StreamFormer learns global semantics, temporal dynamics, and fine-grained spatial relationships simultaneously. (iii) We conduct extensive experiments on online action detection, online video instance segmentation, and video question answering. StreamFormer achieves competitive results while maintaining efficiency, demonstrating its potential for real-time applications.
Poster
Yujie Wei · Shiwei Zhang · Hangjie Yuan · Biao Gong · Longxiang Tang · Xiang Wang · Haonan Qiu · Hengjia Li · Shuai Tan · Yingya Zhang · Hongming Shan

[ Exhibit Hall I ]

Abstract
Relational video customization refers to the creation of personalized videos that depict user-specified relations between two subjects, a crucial task for comprehending real-world visual content. While existing methods can personalize subject appearances and motions, they still struggle with complex relational video customization, where precise relational modeling and high generalization across subject categories are essential. The primary challenge arises from the intricate spatial arrangements, layout variations, and nuanced temporal dynamics inherent in relations; consequently, current models tend to overemphasize irrelevant visual details rather than capturing meaningful interactions. To address these challenges, we propose $\textbf{DreamRelation}$, a novel approach that personalizes relations through a small set of exemplar videos, leveraging two key components: Relational Decoupling Learning and Relational Dynamics Enhancement. First, in Relational Decoupling Learning, we disentangle relations from subject appearances using relation LoRA triplet and hybrid mask training strategy, ensuring better generalization across diverse relationships. Furthermore, we determine the optimal design of relation LoRA triplet by analyzing the distinct roles of the query, key, and value features within MM-DiT's attention mechanism, making DreamRelation the first relational video generation framework with explainable components. Second, in Relational Dynamics Enhancement, we introduce space-time relational contrastive loss, which prioritizes relational dynamics while minimizing the reliance on …
Poster
Yiming Huang · Zhiyang Dou · Lingjie Liu

[ Exhibit Hall I ]

Abstract
Human motion is highly diverse and dynamic, posing challenges for imitation learning algorithms that aim to generalize motor skills for controlling simulated characters. Prior methods typically rely on a universal full-body controller for tracking reference motion (tracking-based model) or a unified full-body skill embedding space (skill embedding). However, these approaches often struggle to generalize and scale to larger motion datasets. In this work, we introduce a novel skill learning framework, ModSkill, that decouples complex full-body skills into compositional, modular skills for independent body parts, leveraging body structure-inspired inductive bias to enhance skill learning performance. Our framework features a skill modularization attention mechanism that processes policy observations into modular skill embeddings that guide low-level controllers for each body part. We further propose an Active Skill Learning approach with Generative Adaptive Sampling, using large motion generation models to adaptively enhance policy learning in challenging tracking scenarios. Results show that this modularized skill learning framework, enhanced by generative sampling, outperforms existing methods in precise full-body motion tracking and enables reusable skill embeddings for diverse goal-driven tasks. Our code will be released publicly upon publication.
Poster
Jensen Zhou · Hang Gao · Vikram Voleti · Aaryaman Vasishta · Chun-Han Yao · Mark Boss · Philip Torr · Christian Rupprecht · Varun Jampani

[ Exhibit Hall I ]

Abstract
We present $\underline{\text{S}}$tabl$\underline{\text{e}}$ $\underline{\text{V}}$irtual C$\underline{\text{a}}$mera (Seva), a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras.Existing works struggle to generate either large viewpoint changes or temporally smooth samples, while relying on specific task configurations.Our approach overcomes these limitations through simple model design, optimized training recipe, and flexible sampling strategy that generalize across view synthesis tasks at test time.As a result, our samples maintain high consistency without requiring additional 3D representation-based distillation, thus streamlining view synthesis in the wild.Furthermore, we show that our method can generate high-quality videos lasting up to half a minute with seamless loop closure.Extensive benchmarking demonstrates that Seva outperforms existing methods across different datasets and settings.
Poster
Qiaosi Yi · Shuai Li · Rongyuan Wu · Lingchen Sun · Yuhui WU · Lei Zhang

[ Exhibit Hall I ]

Abstract
Impressive results on real-world image super-resolution (Real-ISR) have been achieved by employing pre-trained stable diffusion (SD) models. However, one well-known yet critical issue of such methods lies in their poor reconstruction of image fine structures, such as small characters and textures, due to the aggressive resolution reduction of the VAE (e.g., 8$\times$ downsampling) in the SD model. One solution is to employ a VAE with a lower downsampling rate for diffusion; however, adapting its latent features with the pre-trained UNet to preserve the diffusion prior while mitigating the increased computational cost poses new challenges. To address these issues, we propose a transfer VAE training (TVT) strategy to transfer the 8$\times$ downsampled VAE into a 4$\times$ one while preserving the pre-trained diffusion prior. Specifically, we first train a 4$\times$ decoder based on the output features of the original VAE encoder, then train a 4$\times$ encoder while keeping the newly trained decoder fixed. Such a TVT strategy helps align the new encoder-decoder pair with the original VAE latent space while enhancing image fine details. Additionally, we introduce a compact VAE and compute-efficient UNet by optimizing their network architectures, reducing the overall computational cost while effectively capturing high-resolution fine-scale features. Experimental results demonstrate …
Poster
Jian-Jian Jiang · Xiao-Ming Wu · Yi-Xiang He · Ling-An Zeng · Yilin Wei · Dandan Zhang · Wei-Shi Zheng

[ Exhibit Hall I ]

Abstract
Bimanual robotic manipulation is an emerging and critical topic in the robotics community. Previous works primarily rely on integrated control models that take the perceptions and states of both arms as inputs to directly predict their actions. However, we think bimanual manipulation involves not only coordinated tasks but also various uncoordinated tasks that do not require explicit cooperation during execution, such as grasping objects with the closest hand, which integrated control frameworks ignore to consider due to their enforced cooperation in the early inputs. In this paper, we propose a novel decoupled interaction framework that considers the characteristics of different tasks in bimanual manipulation. The key insight of our framework is to assign an independent model to each arm to enhance the learning of uncoordinated tasks, while introducing a selective interaction module that adaptively learns weights from its own arm to improve the learning of coordinated tasks. Extensive experiments on seven tasks in the RoboTwin dataset demonstrate that: (1) Our framework achieves outstanding performance, with a 23.5% boost over the SOTA method. (2) Our framework is flexible and can be seamlessly integrated into existing methods. (3) Our framework can be effectively extended to multi-agent manipulation tasks, achieving a 28% boost …
Poster
Peng Wang · Yongcai Wang · Hualong Cao · Wang Chen · Deying Li

[ Exhibit Hall I ]

Abstract
This paper proposes **LA-MOTR**, a novel Tracking-by-Learnable-Association framework that resolves the competing optimization objectives between detection and association in end-to-end Tracking-by-Attention (TbA) Multi-Object Tracking. Current TbA methods employ shared decoders for simultaneous object detection and tracklet association, which often results in task interference and suboptimal accuracy. By contrast, our end-to-end framework decouples these tasks into two specialized modules: _Separated Object-Tracklet Detection (SOTD)_ and _Spatial-Guided Learnable Association (SGLA)_. This decoupled design offers flexibility and explainability. In particular, SOTD independently detects new objects and existing tracklets in each frame, while SGLA associates them via Spatial-Weighted Learnable Attention module guided by relative spatial cues. Temporal coherence is further maintained through Tracklet Updates Module. The learnable association mechanism resolves the inherent suboptimal association issues in decoupled frameworks, avoiding the task interference commonly observed in joint approaches. Evaluations on DanceTrack, MOT17, and SportMOT datasets demonstrate state-of-the-art performance. Extensive ablation studies validate the effectiveness of the designed modules. Code will be publicly available.
Poster
Xincheng Shuai · Henghui Ding · Zhenyuan Qin · Hao Luo · Xingjun Ma · Dacheng Tao

[ Exhibit Hall I ]

Abstract
Controlling the movements of dynamic objects and the camera within generated videos is a meaningful yet challenging task. Due to the lack of datasets with comprehensive 6D pose annotations, existing text-to-video methods can not simultaneously control the motions of both camera and objects in 3D-aware manner, resulting in limited controllability over generated contents. To address this issue and facilitate the research in this field, we introduce a ***Syn****thetic* Dataset for ***F****ree-Form* ***M****otion* ***C****ontrol* (***SynFMC***). The proposed *SynFMC* dataset includes diverse object and environment categories and covers various motion patterns according to specific rules, simulating common and complex real-world scenarios. The complete 6D pose information facilitates models learning to disentangle the motion effects from objects and the camera in a video. To provide precise 3D-aware motion control, we further propose a method trained on *SynFMC*, *Free-Form Motion Control* (*FMC*). *FMC* can control the 6D poses of objects and camera independently or simultaneously, producing high-fidelity videos. Moreover, it is compatible with various personalized text-to-image (T2I) models for different content styles. Extensive experiments demonstrate that the proposed *FMC* outperforms previous methods across multiple scenarios.
Poster
Panjian Huang · Saihui Hou · Junzhou Huang · Yongzhen Huang

[ Exhibit Hall I ]

Abstract
``What I cannot create, I do not understand.'' Human wisdom reveals that creation is one of the highest forms of learning. For example, Diffusion Models have demonstrated remarkable semantic structural and memory capabilities in image generation, denoising, and restoration, which intuitively benefits representation learning. However, current gait networks rarely embrace this perspective, relying primarily on learning by contrasting gait samples under varying complex conditions, leading to semantic inconsistency and uniformity issues. To address these issues, we propose Origins with generative capabilities whose underlying philosophy is that different entities are generated from a unified template, inherently regularizing gait representations within a consistent and diverse semantic space to capture differences accurately. Admittedly, learning this unified template is exceedingly challenging, as it requires the comprehensiveness of the template to encompass gait representations with various conditions. Inspired by Diffusion Models, Origins diffuses the unified template into timestep templates for gait generative modeling, and meanwhile transfers the unified template for gait representation learning. Especially, gait generative modeling and representation learning serve as a unified framework to end-to-end joint training. Extensive experiments on CASIA-B, CCPG, SUSTech1K, Gait3D, GREW and CCGR-MINI demonstrate that Origins performs representation learning within a unified template, achieving superior performance.
Poster
Zheyun Qin · Deng Yu · Chuanchen Luo · Zhumin Chen

[ Exhibit Hall I ]

Abstract
In recent years, researchers have explored the task of open-vocabulary video instance segmentation, which aims to identify, track, and segment any instance within an open set of categories. The core challenge of Open-Vocabulary VIS lies in solving the cross-domain alignment problem, including spatial-temporal and text-visual domain alignments. Existing methods have made progress but still face shortcomings in addressing these alignments, especially due to data heterogeneity. Inspired by metric learning, we propose an innovative Sliced Wasserstein Bridging Learning Framework. This framework utilizes the Sliced Wasserstein distance as the core tool for metric learning, effectively bridging the four domains involved in the task. Our innovations are threefold: (1) Domain Alignment: By mapping features from different domains into a unified metric space, our method maintains temporal consistency and learns intrinsic consistent features between modalities, improving the fusion of text and visual information. (2) Weighting Mechanism: We introduce an importance weighting mechanism to enhance the discriminative ability of our method when dealing with imbalanced or significantly different data. (3) High Efficiency: Our method inherits the computational efficiency of the Sliced Wasserstein distance, allowing for online processing of large-scale video data while maintaining segmentation accuracy. Through extensive experimental evaluations, we have validated the robustness of …
Poster
Yunshan Zhong · Yuyao Zhou · Yuxin Zhang · Wanchen Sui · Shen Li · Yong Li · Fei Chao · Rongrong Ji

[ Exhibit Hall I ]

Abstract
Data-free quantization (DFQ) enables model quantization without accessing real data, addressing concerns regarding data security and privacy. With the growing adoption of Vision Transformers (ViTs), DFQ for ViTs has garnered significant attention. However, existing DFQ methods exhibit two limitations: (1) semantic distortion, where the semantics of synthetic images deviate substantially from those of real images, and (2) semantic inadequacy, where synthetic images contain extensive regions with limited content and oversimplified textures, leading to suboptimal quantization performance. To address these limitations, we propose SARDFQ, a novel Semantics Alignment and Reinforcement Data-Free Quantization method for ViTs. To address semantic distortion, SARDFQ incorporates Attention Priors Alignment (APA), which optimizes synthetic images to follow randomly generated structure attention priors. To mitigate semantic inadequacy, SARDFQ introduces Multi-Semantic Reinforcement (MSR), leveraging localized patch optimization to enhance semantic richness across synthetic images. Furthermore, SARDFQ employs Soft-Label Learning (SL), wherein multiple semantic targets are adapted to facilitate the learning of multi-semantic images augmented by MSR. Extensive experiments demonstrate the effectiveness of SARDFQ, significantly surpassing existing methods. For example, SARDFQ improves top-1 accuracy on ImageNet by 15.52% for W4A4 ViT-B
Poster
Bowen Zhang · Sicheng Xu · Chuxin Wang · Jiaolong Yang · Feng Zhao · Dong Chen · Baining Guo

[ Exhibit Hall I ]

Abstract
In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content.
Poster
Avihai Naaman · Ron Shapira Weber · Oren Freifeld

[ Exhibit Hall I ]

Abstract
Synchronizing multiple videos depicting the same action is straightforward when recorded from a single scene with multiple cameras, often reducible to a simple time-axis shift. However, in-the-wild scenarios and, more recently, multiple generative AI–produced videos pose a far more complex challenge due to diverse subjects, backgrounds, and nonlinear temporal misalignments. We propose Temporal Prototype Learning (TPL), a prototype-based framework that constructs a shared, compact 1D representation from high-dimensional embeddings extracted by any off-the-shelf model. TPL robustly aligns videos—whether real-world or generative—by learning a unified prototype sequence that anchors key action phases, thereby avoiding exhaustive pairwise matching. Our experiments show that TPL offers improved synchronization accuracy, efficiency, and robustness across diverse datasets, including fine-grained frame retrieval and phase classification tasks. Crucially, TPL is the first approach to mitigate out-of-sync issues for multiple generative AI videos of the same action. We will release our code upon acceptance.
Poster
Yinqi Cai · Jichang Li · Zhaolun Li · Weikai Chen · Rushi Lan · Xi Xie · Xiaonan Luo · Guanbin Li

[ Exhibit Hall I ]

Abstract
Recent advances in deep generative models have made it easier to manipulate face videos, raising significant concerns about their potential misuse for fraud and misinformation. Existing detectors often perform well in in-domain scenarios but fail to generalize across diverse manipulation techniques due to their reliance on forgery-specific artifacts. In this work, we introduce DeepShield, a novel deepfake detection framework that balances local sensitivity and global generalization to improve robustness across unseen forgeries. DeepShield enhances the CLIP-ViT encoder through two key components: Local Patch Guidance (LPG) and Global Forgery Diversification (GFD). LPG applies spatiotemporal artifact modeling and patch-wise supervision to capture fine-grained inconsistencies often overlooked by global models. GFD introduces domain feature augmentation, leveraging domain-bridging and boundary-expanding feature generation to synthesize diverse forgeries, mitigating overfitting and enhancing cross-domain adaptability. Through the integration of novel local and global analysis for deepfake detection, DeepShield outperforms state-of-the-art methods in cross-dataset and cross-manipulation evaluations, achieving superior robustness against unseen deepfake attacks.
Poster
Liang Xu · Chengqun Yang · Zili Lin · Fei Xu · Yifan Liu · Congsheng Xu · Yiyi Zhang · Jie Qin · Xingdong Sheng · Yunhui Liu · Xin Jin · Yichao Yan · Wenjun Zeng · Xiaokang Yang

[ Exhibit Hall I ]

Abstract
Learning action models from real-world human-centric interaction datasets is important towards building general-purpose intelligent assistants with efficiency. However, most existing datasets only offer specialist interaction category and ignore that AI assistants perceive and act based on first-person acquisition. We urge that both the generalist interaction knowledge and egocentric modality are indispensable. In this paper, we embed the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands. With our hybrid RGB-MoCap system, pairs of assistants and instructors engage with multiple objects and the scene following GPT-generated scripts. Under this setting, we accomplish InterVLA, the first large-scale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data, spanning 2 egocentric and 5 exocentric videos, accurate human/object motions and verbal commands. Furthermore, we establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis. We believe that our InterVLA testbed and the benchmarks will foster future works on building AI agents in the physical world.
Poster
Fating Hong · Zunnan Xu · Zixiang Zhou · Jun Zhou · Xiu Li · Qin Lin · Qinglin Lu · Dan Xu

[ Exhibit Hall I ]

Abstract
Talking head synthesis is vital for virtual avatars and human-computer interaction. However, most existing methods are typically limited to accepting control from a single primary modality, restricting their practical utility. To this end, we introduce ACTalker, an end-to-end video diffusion framework that supports both multi-signals control and single-signal control for talking head video generation. For multiple control, we design a parallel mamba structure with multiple branches, each utilizing a separate driving signal to control specific facial regions. A gate mechanism is applied across all branches, providing flexible control over video generation. To ensure natural coordination of the controlled video both temporally and spatially, we employ the mamba structure, which enables driving signals to manipulate feature tokens across both dimensions in each branch. Additionally, we introduce a mask-drop strategy that allows each driving signal to independently control its corresponding facial region within the mamba structure, preventing control conflicts. Experimental results demonstrate that our method produces natural-looking facial videos driven by diverse signals and that the mamba layer seamlessly integrates multiple driving modalities without conflict.
Poster
Shiqi Huang · Shuting He · Huaiyuan Qin · Bihan Wen

[ Exhibit Hall I ]

Abstract
Most existing remote sensing instance segmentation approaches are designed for close-vocabulary prediction, limiting their ability to recognize novel categories or generalize across datasets. This restricts their applicability in diverse Earth observation scenarios. To address this, we introduce open-vocabulary (OV) learning for remote sensing instance segmentation. While current OV segmentation models perform well on natural image datasets, their direct application to remote sensing faces challenges such as diverse landscapes, seasonal variations, and the presence of small or ambiguous objects in aerial imagery. To overcome these challenges, we propose SCORE (Scene Context matters in Open-vocabulary REmote sensing instance segmentation), a framework that integrates multi-granularity scene context, i.e., regional context and global context, to enhance both visual and textual representations. Specifically, we introduce Region-Aware Integration, which refines class embeddings with regional context to improve object distinguishability. Additionally, we propose Global Context Adaptation, which enriches naive text embeddings with remote sensing global context, creating a more adaptable and expressive linguistic latent space for the classifier. We establish new benchmarks for OV remote sensing instance segmentation across diverse datasets. Experimental results demonstrate that, our proposed method achieves SOTA performance, which provides a robust solution for large-scale, real-world geospatial analysis.
Poster
Xiang Zhang · Yawar Siddiqui · Armen Avetisyan · Christopher Xie · Jakob Engel · Henry Howard-Jenkins

[ Exhibit Hall I ]

Abstract
We introduce VertexRegen, a novel mesh generation framework that enables generation at a continuous level of detail. Existing autoregressive methods generate meshes in a partial-to-complete manner and thus intermediate steps of generation represent incomplete structures. VertexRegen, takes inspiration from progressive meshes and reformulates the process as the reversal of edge collapse, i.e. vertex split, learned through a generative model. Experimental results demonstrate that VertexRegen produces meshes of comparable quality to state-of-the-art methods while uniquely offering anytime generation with the flexibility to halt at any step to yield valid meshes with varying levels of detail.
Poster
Fuyan Ma · Yiran He · Bin Sun · Shutao Li

[ Exhibit Hall I ]

Abstract
Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propose a multimodal prompt alignment framework for FER, called MPA-FER, that provides fine-grained semantic guidance to the learning process of prompted visual features, resulting in more precise and interpretable representations. Specifically, we introduce a multi-granularity hard prompt generation strategy that utilizes a large language model (LLM) like ChatGPT to generate detailed descriptions for each facial expression. The LLM-based external knowledge is injected into the soft prompts by minimizing the feature discrepancy between the soft prompts and the hard prompts. To preserve the generalization abilities of the pretrained CLIP model, our approach incorporates prototype-guided visual feature alignment, ensuring that the prompted visual features from the frozen image encoder align closely with class-specific prototypes. Additionally, we propose a cross-modal global-local alignment module that focuses on expression-relevant facial features, further improving the alignment between textual and visual features. Extensive experiments demonstrate our framework outperforms state-of-the-art methods on three FER benchmark datasets, while retaining the …
Poster
Jiaxin Huang · Sheng Miao · Bangbang Yang · Yuewen Ma · Yiyi Liao

[ Exhibit Hall I ]

Abstract
Reconstructing 4D dynamic scenes from casually captured monocular videos is valuable but highly challenging, as each timestamp is observed from a single viewpoint. We introduce Vivid4D, a novel approach that enhances 4D monocular video synthesis by augmenting observation views — synthesizing multi-view videos from a monocular input. Unlike existing methods that either solely leverage geometric priors for supervision or use generative priors while overlooking geometry, we integrate both. This reformulates view augmentation as a video inpainting task, where observed views are warped into new viewpoints based on monocular depth priors. To achieve this, we train a video inpainting model on unposed web videos with synthetically generated masks that mimic warping occlusions, ensuring spatially and temporally consistent completion of missing regions. To further mitigate inaccuracies in monocular depth priors, we introduce an iterative view augmentation strategy and a robust reconstruction loss. Experiments demonstrate that our method effectively improves monocular 4D scene reconstruction and completion.
Poster
Hao Li · Ju Dai · Feng Zhou · Kaida Ning · Lei Li · Junjun Pan

[ Exhibit Hall I ]

Abstract
While 3D facial animation has made impressive progress, challenges still exist in realizing fine-grained stylized 3D facial expression manipulation due to the lack of appropriate datasets. In this paper, we introduce the AUBlendSet, a 3D facial dataset based on AU-Blendshape representation for fine-grained facial expression manipulation across identities. AUBlendSet is a blendshape data collection based on 32 standard facial action units (AUs) across 500 identities, along with an additional set of facial postures annotated with detailed AUs. Based on AUBlendSet, we propose AUBlendNet to learn AU-Blendshape basis vectors for different character styles. AUBlendNet predicts, in parallel, the AU-Blendshape basis vectors of the corresponding style for a given identity mesh, thereby achieving stylized 3D emotional facial manipulation. We comprehensively validate the effectiveness of AUBlendSet and AUBlendNet through tasks such as stylized facial expression manipulation, speech-driven emotional facial animation, and emotion recognition data augmentation. Through a series of qualitative and quantitative experiments, we demonstrate the potential and importance of AUBlendSet and AUBlendNet in 3D facial animation tasks. To the best of our knowledge, AUBlendSet is the first dataset, and AUBlendNet is the first network for continuous 3D facial expression manipulation for any identity through facial AUs.
Poster
Quanwei Yang · Luying Huang · Kaisiyuan Wang · Jiazhi Guan · Shengyi He · Fengguo Li · Hang Zhou · Lingyun Yu · Yingying Li · Haocheng Feng · Hongtao Xie

[ Exhibit Hall I ]

Abstract
While increasing attention has been paid to human gesture synthesis, most previous works concentrate on holistic body movements without investigating hand gestures with explicit and essential semantics. In this paper, we study co-speech gesture generation with an emphasis on specific hand gesture activation, which can deliver more instructional information than common body movements. To achieve this, we first build a high-quality dataset of 3D human body movements including a set of semantically explicit hand gestures that are commonly used by live streamers. Then we present a hybrid-modality gesture generation system built upon hybrid-modality diffusion transformer architecture with novelly designed motion-style injective transformer layers, which enables advanced gesture modeling ability and versatile gesture operations. To guarantee these specific hand gestures can be activated, we introduce a cascaded retrieval-augmented generation strategy built upon a semantic gesture repository annotated for each subject and an adaptive audio-gesture synchronization mechanism, which substantially improves semantic gesture activation and production efficiency.Quantitative and qualitative experiments demonstrate that our proposed approach achieves superior performance over all the counterparts.
Poster
Hao Li · Xiang Chen · Jiangxin Dong · Jinhui Tang · Jinshan Pan

[ Exhibit Hall I ]

Abstract
Despite the significant progress made by all-in-one models in universal image restoration, existing methods suffer from a generalization bottleneck in real-world scenarios, as they are mostly trained on small-scale synthetic datasets with limited degradations. Therefore, large-scale high-quality real-world training data is urgently needed to facilitate the emergence of foundation models for image restoration. To advance this field, we spare no effort in contributing a million-scale dataset with two notable advantages over existing training data: larger-scale real-world samples, and higher-diversity data types. By adjusting internal camera settings and external imaging conditions, we can capture aligned image pairs using our well-designed data acquisition system over multiple rounds and our data alignment criterion. Moreover, we propose a robust model, FoundIR, to better address a broader range of restoration tasks in real-world scenarios, taking a further step toward foundation models. Specifically, we first utilize a diffusion-based generalist model to remove degradations by learning the degradation-agnostic common representations from diverse inputs, where incremental learning strategy is adopted to better guide model training. To refine the model's restoration capability in complex scenarios, we introduce degradation-aware specialist models for achieving final high-quality results. Extensive experiments show the value of our dataset and the effectiveness of our method.
Poster
Zeyu Wang · Jizheng Zhang · Haiyu Song · Mingyu Ge · Jiayu Wang · Haoran Duan

[ Exhibit Hall I ]

Abstract
Infrared and visible image fusion (VIS-IR) aims to integrate complementary information from both source images to produce a fused image with enriched details. However, most existing fusion models lack controllability, making it difficult to customize the fused output according to user preferences. To address this challenge, we propose a novel weakly-supervised, instance-level controllable fusion model that adaptively highlights user-specified instances based on input text. Our model consists of two stages: pseudo-label generation and fusion network training. In the first stage, guided by observed multimodal manifold priors, we leverage text and manifold similarity as joint supervisory signals to train text-to-image response network (TIRN) in a weakly-supervised manner, enabling it to identify referenced semantic-level objects from instance segmentation outputs. To align text and image features in TIRN, we propose a multimodal feature alignment module (MFA), using manifold similarity to guide attention weight assignment for precise correspondence between image patches and text embeddings. Moreover, we employ spatial positional relationships to accurately select the referenced instances from multiple semantic-level objects. In the second stage, the fusion network takes source images and text as input, using the generated pseudo-labels for supervision to apply distinct fusion strategies for target and non-target regions. Experimental results show that …
Poster
Youwei Zhou · Tianyang Xu · Cong Wu · Xiaojun Wu · Josef Kittler

[ Exhibit Hall I ]

Abstract
The shared topology of human skeletons motivated the recent investigation of graph convolutional network (GCN) solutions for action recognition.However, most of the existing GCNs rely on the binary connection of two neighboring vertices (joints) formed by an edge (bone), overlooking the potential of constructing multi-vertex convolution structures.Although some studies have attempted to utilize hyper-graphs to represent the topology, they rely on a fixed construction strategy, which limits their adaptivity in uncovering the intricate latent relationships within the action.In this paper, we address this oversight and explore the merits of an adaptive hyper-graph convolutional network (Hyper-GCN) to achieve the aggregation of rich semantic information conveyed by skeleton vertices.In particular, our Hyper-GCN adaptively optimises the hyper-graphs during training, revealing the action-driven multi-vertex relations. Besides, virtual connections are often designed to support efficient feature aggregation, implicitly extending the spectrum of dependencies within the skeleton.By injecting virtual connections into hyper-graphs, the semantic clues of diverse action categories can be highlighted. The results of experiments conducted on the NTU-60, NTU-120, and NW-UCLA datasets demonstrate the merits of our Hyper-GCN, compared to the state-of-the-art methods.Specifically, we outperform the existing solutions on NTU-120, achieving 90.5\% and 91.7\% in terms of the top-1 recognition accuracy on X-Sub and …
Poster
Yafei Zhang · Lingqi Kong · Huafeng Li · Jie Wen

[ Exhibit Hall I ]

Abstract
To reduce the reliance of visible-infrared person re-identification (ReID) models on labeled cross-modal samples, this paper explores a weakly supervised cross-modal person ReID method that uses only single-modal sample identity labels, addressing scenarios where cross-modal identity labels are unavailable. To mitigate the impact of missing cross-modal labels on model performance, we propose a heterogeneous expert collaborative consistency learning framework, designed to establish robust cross-modal identity correspondences in a weakly supervised manner. This framework leverages labeled data from each modality to independently train dedicated classification experts. To associate cross-modal samples, these classification experts act as heterogeneous predictors, predicting the identities of samples from the other modality. To improve prediction accuracy, we design a cross-modal relationship fusion mechanism that effectively integrates predictions from different experts. Under the implicit supervision provided by cross-modal identity correspondences, collaborative and consistent learning among the experts is encouraged, significantly enhancing the model’s ability to extract modality-invariant features and improve cross-modal identity recognition. Experimental results on two challenging datasets validate the effectiveness of the proposed method.
Poster
Geonhee Sim · Gyeongsik Moon

[ Exhibit Hall I ]

Abstract
Two major approaches exist for creating animatable human avatars. The first, a 3D-based approach, optimizes a NeRF- or 3DGS-based avatar from videos of a single person, achieving personalization through a disentangled identity representation. However, modeling pose-driven deformations, such as non-rigid cloth deformations, requires numerous pose-rich videos, which are costly and impractical to capture in daily life. The second, a diffusion-based approach, learns pose-driven deformations from large-scale in-the-wild videos but struggles with identity preservation and pose-dependent identity entanglement. We present PERSONA, a framework that combines the strengths of both approaches to obtain a personalized 3D human avatar with pose-driven deformations from a single image. PERSONA leverages a diffusion-based approach to generate pose-rich videos from the input image and optimizes a 3D avatar based on them. To ensure high authenticity and sharp renderings across diverse poses, we introduce balanced sampling and geometry-weighted optimization. Balanced sampling oversamples the input image to mitigate identity shifts in diffusion-generated training videos. Geometry-weighted optimization prioritizes geometry constraints over image loss, preserving rendering quality in diverse poses. Code and weights will be publicly available.
Poster
Siyuan Yan · Ming Hu · Yiwen Jiang · Xieji Li · Hao Fei · Philipp Tschandl · Harald Kittler · Zongyuan Ge

[ Exhibit Hall I ]

Abstract
The emergence of vision-language models has transformed medical AI, enabling unprecedented advances in diagnostic capability and clinical applications. However, progress in dermatology has lagged behind other medical domains due to the lack of standard image-text pairs. Existing dermatological datasets are limited in both scale and depth, offering only single-label annotations across a narrow range of diseases instead of rich textual descriptions, and lacking the crucial clinical context needed for real-world applications. To address these limitations, we present Derm1M, the first large-scale vision-language dataset for dermatology, comprising 1,029,761 image-text pairs. Built from diverse educational resources and structured around a standard ontology collaboratively developed by experts, Derm1M provides comprehensive coverage for over 390 skin conditions across four hierarchical levels and 130 clinical concepts with rich contextual information such as medical history, symptoms, and skin tone. To demonstrate Derm1M’s potential in advancing both AI research and clinical application, we pretrained a series of CLIP-like models, collectively called DermLIP, on this dataset. The DermLIP family significantly outperforms state-of-the-art foundation models on eight diverse datasets across multiple tasks, including zero-shot skin disease classification, clinical and artifacts concept identification, few-shot/full-shot learning, and cross-modal retrieval. Our dataset and code will be public.
Poster
Weijie Lyu · Yi Zhou · Ming-Hsuan Yang · Zhixin Shu

[ Exhibit Hall I ]

Abstract
We present $\textit{FaceLift}$, a novel feed-forward approach for generalizable high-quality 360-degree 3D head reconstruction from a single image. Our pipeline first employs a multi-view latent diffusion model to generate consistent side and back views from a single facial input, which then feed into a transformer-based reconstructor that produces a comprehensive 3D Gaussian Splats representation. Previous methods for monocular 3D face reconstruction often lack full view coverage or view consistency due to insufficient multi-view supervision. We address this by creating a high-quality synthetic head dataset that enables consistent supervision across viewpoints. To bridge the domain gap between synthetic training data and real-world images, we propose a simple yet effective technique that ensures the view generation process maintains fidelity to the input by learning to reconstruct the input image alongside the view generation. Despite being trained exclusively on synthetic data, our method demonstrates remarkable generalization to real-world images. Through extensive qualitative and quantitative evaluations, we show that $\textit{FaceLift}$ outperforms state-of-the-art 3D face reconstruction methods on identity preservation, detail recovery and rendering quality.
Poster
Youzhuo Wang · jiayi ye · Chuyang Xiao · Yiming Zhong · Heng Tao · Hang Yu · Yumeng Liu · Jingyi Yu · Yuexin Ma

[ Exhibit Hall I ]

Abstract
Handover between a human and a dexterous robotic hand is a fundamental yet challenging task in human-robot collaboration. It requires handling dynamic environments and a wide variety of objects, and demands robust and adaptive grasping strategies. However, progress in developing effective dynamic dexterous grasping methods is limited by the absence of high-quality, real-world human-to-robot handover datasets. Existing datasets primarily focus on grasping static objects or rely on synthesized handover motions, which differ significantly from real-world robot motion patterns, creating a substantial gap in applicability.In this paper, we introduce DexH2R, a comprehensive real-world dataset for human-to-robot handovers, built on dexterous robotic hand. Our dataset captures a diverse range of interactive objects, dynamic motion patterns, rich visual sensor data, and detailed annotations. Additionally, to ensure natural and human-like dexterous motions, we utilize teleoperation for data collection, enabling the robot’s movements to align with human behaviors and habits, which is a crucial characteristic for intelligent humanoid robots.Furthermore, we propose an effective solution, DynamicGrasp, for human-to-robot handover and evaluate various state-of-the-art approaches, including auto-regressive models and diffusion policy methods, providing a thorough comparison and analysis. We believe our benchmark will drive advancements in human-to-robot handover research by offering a high-quality dataset, effective solutions, and …
Poster
Yuang Wang · Chao Wen · Haoyu Guo · Sida Peng · Minghan Qin · Hujun Bao · Ruizhen Hu · Xiaowei Zhou

[ Exhibit Hall I ]

Abstract
We present visual action prompts, an unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality tradeoff: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for its generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources -- human-object interactions (HOI) and dexterous robotic manipulation -- enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach.
Poster
Yan Zhang · Weiyang Liu · Alpár Cseke · Nitin Saini · Nathan Bajandas · Nicolas Heron · Michael Black

[ Exhibit Hall I ]

Abstract
To build a motor system of the interactive avatar, it is essential to develop a generative motion model, which at least can drive the body to move in 3D space in a perpetual, realistic, controllable, and responsive manner. Although motion generation has been extensively studied in the past, most methods can be hardly regarded as embodied intelligence, due to their offline setting, slow speed, limited motion lengths, unnaturalness, and more. To overcome these limitations, we propose PRIMAL, an autoregressive diffusion model that is learned with a two-stage paradigm, inspired by recent advances of foundation models. In the pretraining stage, we let the model concentrate on learning motion dynamics from a large number of sub-second motion segments. In the adaptation phase, we propose a generic ControlNet-like adaptor, and fine-tune it on semantic action generation and spatial target reaching. Experiments show that physics effects emerge in our results. Given a single-frame initial state, our model not only generates unbounded, realistic, and controllable motion, but also enables the avatar to be responsive to induced impulses in real time. In addition, we can effectively and efficiently adapt our base model to few-shot personalized actions and the task of spatial control. Evaluations show that our …
Poster
Hengjia Li · Lifan Jiang · Xi Xiao · Tianyang Wang · Hongwei Yi · Boxi Wu · Deng Cai

[ Exhibit Hall I ]

Abstract
Video identity customization seeks to produce high-fidelity videos that maintain consistent identity and exhibit significant dynamics based on users' reference images. However, existing approaches face two key challenges: identity degradation over extended video length and reduced dynamics during training, primarily due to their reliance on traditional self-reconstruction training with static images. To address these issues, we introduce $\textbf{MagicID}$, a novel framework designed to directly promote the generation of identity-consistent and dynamically rich videos tailored to user preferences. Specifically, we propose constructing pairwise preference video data with explicit identity and dynamic rewards for preference learning, instead of sticking to the traditional self-reconstruction. To address the constraints of customized preference data, we introduce a hybrid sampling strategy. This approach first prioritizes identity preservation by leveraging static videos derived from reference images, then enhances dynamic motion quality in the generated videos using a Frontier-based sampling method. By utilizing these hybrid preference pairs, we optimize the model to align with the reward differences between pairs of customized preferences. Extensive experiments show that MagicID successfully achieves consistent identity and natural dynamics, surpassing existing methods across various metrics.
Poster
Weiyi You · Mingyang Zhang · Leheng Zhang · Xingyu Zhou · Kexuan Shi · Shuhang Gu

[ Exhibit Hall I ]

Abstract
Current diffusion-based super-resolution (SR) approaches achieve commendable performance at the cost of high inference overhead. Therefore, distillation techniques are utilized to accelerate the multi-step teacher model into one-step student model. Nevertheless, these methods significantly raise training costs and constrain the performance of the student model by the teacher model. To overcome these tough challenges, we propose Consistency Trajectory Matching for Super-Resolution (CTMSR), a distillation-free strategy that is able to generate photo-realistic SR results in one step. Concretely, we first formulate a Probability Flow Ordinary Differential Equation (PF-ODE) trajectory to establish a deterministic mapping from low-resolution (LR) images with noise to high-resolution (HR) images. Then we apply the Consistency Training (CT) strategy to directly learn the mapping in one step, eliminating the necessity of pre-trained diffusion model. To further enhance the performance and better leverage the ground-truth during the training process, we aim to align the distribution of SR results more closely with that of the natural images. To this end, we propose to minimize the discrepancy between their respective PF-ODE trajectories from the LR image distribution by our meticulously designed Distribution Trajectory Matching (DTM) loss, resulting in improved realism of our recovered HR images. Comprehensive experimental results demonstrate that the …
Poster
Jeonghyeok Do · Munchurl Kim

[ Exhibit Hall I ]

Abstract
In zero-shot skeleton-based action recognition (ZSAR), aligning skeleton features with the text features of action labels is essential for accurately predicting unseen actions. ZSAR faces a fundamental challenge in bridging the modality gap between the two-kind features, which severely limits generalization to unseen actions. Previous methods focus on direct alignment between skeleton and text latent spaces, but the modality gaps between these spaces hinder robust generalization learning. Motivated by the success of diffusion models in multi-modal alignment (e.g., text-to-image, text-to-video), we firstly present a diffusion-based skeleton-text alignment framework for ZSAR. Our approach, Triplet Diffusion for Skeleton-Text Matching (TDSM), focuses on cross-alignment power of diffusion models rather than their generative capability. Specifically, TDSM aligns skeleton features with text prompts by incorporating text features into the reverse diffusion process, where skeleton features are denoised under text guidance, forming a unified skeleton-text latent space for robust matching. To enhance discriminative power, we introduce a triplet diffusion (TD) loss that encourages our TDSM to correct skeleton-text matches while pushing them apart for different action classes. Our TDSM significantly outperforms very recent state-of-the-art methods with significantly large margins of 2.36\%-point to 13.05\%-point, demonstrating superior accuracy and scalability in zero-shot settings through effective skeleton-text matching.
Poster
Boyang Deng · Kyle Genova · Songyou Peng · Gordon Wetzstein · Noah Snavely · Leonidas Guibas · Thomas Funkhouser

[ Exhibit Hall I ]

Abstract
We present a system using Multimodal LLMs (MLLMs) to analyze a large database with tens of millions of images captured at different times, with the aim of discovering patterns in temporal changes. Specifically, we aim to capture frequent co-occurring changes ("trends") across a city over a certain period. Unlike previous visual analyses, our analysis answers open-ended queries (e.g., "what are the frequent types of changes in the city?") without any predetermined target subjects or training labels. These properties cast prior learning-based or unsupervised visual analysis tools unsuitable. We identify MLLMs as a novel tool for their open-ended semantic understanding capabilities. Yet, our datasets are four orders of magnitude too large for an MLLM to injest as context. So we introduce a bottom-up procedure that decomposes the massive visual analysis problem into more tractable sub-problems. We carefully design MLLM-based solutions to each sub-problem. During experiments and ablation studies with our system, we find it significantly outperforms baselines and is able to discover interesting trends from images captured in large cities (e.g., "addition of outdoor dining,", "overpass was painted blue," etc.).
Poster
Zhenghong Zhou · Jie An · Jiebo Luo

[ Exhibit Hall I ]

Abstract
Precise camera pose control is crucial for video generation with diffusion models. Existing methods require fine-tuning with additional datasets containing paired videos and camera pose annotations, which are both data-intensive and computationally costly, and may disrupt the model's distribution learned from the training data. We introduce Latent-Reframe, which enables camera control in a pre-trained video diffusion model without fine-tuning. Unlike existing methods, Latent-Reframe operates during the sampling stage, maintaining efficiency while preserving the distribution learned during pretraining. Our approach reframes the latent code of video frames to align with the input camera trajectory through time-aware point clouds. Latent code inpainting and harmonization then refine the model’s latent space, ensuring high-quality video generation. Latent-Reframe can be applied to both DiT- and UNet-based video diffusion models. Experimental results demonstrate that Latent-Reframe can achieve comparable or superior camera control precision and video quality to training-based methods, without the need for fine-tuning on additional datasets. Please open video_results.html in supplementary material to view generated videos.
Poster
Mazlum Arslan · Weihong Guo · Shuo Li

[ Exhibit Hall I ]

Abstract
Traditional deep networks struggle to acquire shape-fair representations due to their high expressivity. Kolmogorov-Arnold Networks (KANs) are promising candidates as they learn nonlinearities directly, a property that makes them more adaptive. However, KANs perform suboptimally in terms of shape-fairness because of unconstrained nonlinearities, a limitation we demonstrate for the first time. On the other hand, shape-fair networks reside on a neuromanifold of low-degree. Motivated by this, we investigate neuromanifold regularization of KANs to enable learning of shape-fair feature representations. The proposed method, NeuroManifold Regularized-KANs, is a novel regularization that addresses failure modes during the acquisition of local and global shape cues, separately. This is done by constraining the degree of the neuromanifolds of two jointly trained feature extractors. Additionally, we propose a novel Style Decorrelation Loss that promotes decorrelation of intermediate representations. Our experiments demonstrate that NMR-KAN improves shape bias over baseline convolutional KANs by 14.8\% while also providing robustness under image corruptions and adversarial attacks.
Poster
Yating Yu · Congqi Cao · Yifan Zhang · Yanning Zhang

[ Exhibit Hall I ]

Abstract
Leveraging the effective visual-text alignment and static generalizability from CLIP, recent video learners adopt CLIP initialization with further regularization or recombination for generalization in open-vocabulary action recognition in-context. However, due to the static bias of CLIP, such video learners tend to overfit on shortcut static features, thereby compromising their generalizability, especially to novel out-of-context actions. To address this issue, we introduce $\textbf{Open-MeDe}$, a novel Meta-optimization framework with static Debiasing for Open-vocabulary action recognition. From a fresh perspective of generalization, Open-MeDe adopts a meta-learning approach to improve $\textbf{known-to-open generalizing}$ and $\textbf{image-to-video debiasing}$ in a cost-effective manner. Specifically, Open-MeDe introduces a cross-batch meta-optimization scheme that explicitly encourages video learners to quickly generalize to arbitrary subsequent data via virtual evaluation, steering a smoother optimization landscape. In effect, the free of CLIP regularization during optimization implicitly mitigates the inherent static bias of the video meta-learner. We further apply self-ensemble over the optimization trajectory to obtain generic optimal parameters that can achieve robust generalization to both in-context and out-of-context novel data. Extensive evaluations show that Open-MeDe not only surpasses state-of-the-art regularization methods tailored for in-context open-vocabulary action recognition but also substantially excels in out-of-context scenarios.
Poster
SeungJun Moon · Hah Min Lew · Seungeun Lee · Ji-Su Kang · Gyeong-Moon Park

[ Exhibit Hall I ]

Abstract
Despite recent progress in 3D head avatar generation, balancing identity preservation, i.e., reconstruction, with novel poses and expressions, i.e., animation, remains a challenge. Existing methods struggle to adapt Gaussians to varying geometrical deviations across facial regions, resulting in suboptimal quality. To address this, we propose GeoAvatar, a framework for adaptive geometrical Gaussian Splatting. GeoAvatar leverages Adaptive Geometrical Initialization (AGI), an unsupervised method that segments Gaussians into rigid and flexible sets for adaptive offset regularization. Then, based on mouth anatomy and dynamics, we introduce a novel mouth structure and the part-wise deformation strategy to enhance the animation fidelity of the mouth. Finally, we propose a regularization loss for precise rigging between Gaussians and 3DMM faces. Moreover, we release DynamicFace, a video dataset with highly expressive facial motions. Extensive experiments show the superiority of GeoAvatar compared to state-of-the-art methods in reconstruction and novel animation scenarios. The dataset and pre-trained models will be released after the review.
Poster
Shuyuan Tu · Qi Dai · Zihao Zhang · Sicheng Xie · Zhi-Qi Cheng · Chong Luo · Xintong Han · Zuxuan Wu · Yu-Gang Jiang

[ Exhibit Hall I ]

Abstract
Despite impressive advancements in diffusion-based video editing models in altering video attributes, there has been limited exploration into modifying motion information while preserving the original protagonist's appearance and background. In this paper, we propose MotionFollower, a score-guided diffusion model for video motion editing. To introduce conditional controls to the denoising process, we propose two signal controllers, one for poses and the other for appearances, both consist of convolution blocks without involving heavy attention calculations. Further, we design a score guidance principle based on a two-branch architecture (a reconstruction and an editing branch), significantly enhancing the modeling capability of texture details and complicated backgrounds. Concretely, we enforce several consistency regularizers during the score estimation. The resulting gradients thus inject appropriate guidance to latents, forcing the model to preserve the original background details and protagonists' appearances without interfering with the motion modification. Experiments demonstrate MotionFollower's competitive motion editing ability qualitatively and quantitatively. Compared with MotionEditor, the most advanced motion editing model, MotionFollower delivers superior motion editing performance and exclusively supports large camera movements. To the best of our knowledge, MotionFollower is the first diffusion model to explore score regularization in video editing.
Poster
Yiyi Ma · Yuanzhi Liang · Xiu Li · Chi Zhang · Xuelong Li

[ Exhibit Hall I ]

Abstract
We present Interleaved Learning for Motion Synthesis (InterSyn), a novel framework that targets the generation of realistic interaction motions by learning from integrated motions that consider both solo and multi-person dynamics. Unlike previous methods that treat these components separately, InterSyn employs an interleaved learning strategy to capture the natural, dynamic interactions and nuanced coordination inherent in real-world scenarios. Our framework comprises two key modules: the Interleaved Interaction Synthesis (INS) module, which jointly models solo and interactive behaviors in a unified paradigm from a first-person perspective to support multiple character interactions, and the Relative Coordination Refinement (REC) module, which refines mutual dynamics and ensures synchronized motions among characters. Experimental results show that the motion sequences generated by InterSyn exhibit higher text-to-motion alignment and improved diversity compared with recent methods, setting a new benchmark for robust and natural motion synthesis.
Poster
Jiaxin Liu · Qichao Ying · Zhenxing Qian · Sheng Li · Runqi Zhang · Jian liu · Xinpeng Zhang

[ Exhibit Hall I ]

Abstract
The widespread use of face retouching on social media platforms raises concerns about the authenticity of face images. While existing methods focus on detecting face retouching, how to accurately recover the original faces from the retouched ones has yet to be answered. This paper introduces Face Retouching Restoration (FRR), a novel computer vision task aimed at restoring original faces from their retouched counterparts. FRR differs from traditional image restoration tasks by addressing the complex retouching operations with various types and degrees, which focuses more on the restoration of the low-frequency information of the faces. To tackle this challenge, we propose MoFRR, Mixture of Diffusion Models for FRR. Inspired by DeepSeek's expert isolation strategy, the MoFRR uses sparse activation of specialized experts handling distinct retouching types and the engagement of a shared expert dealing with universal retouching traces. Each specialized expert follows a dual-branch structure with a DDIM-based low-frequency branch guided by an Iterative Distortion Evaluation Module (IDEM) and a Cross-Attention-based High-Frequency branch (HFCAM) for detail refinement. Extensive experiments on a newly constructed face retouching dataset, RetouchingFFHQ++, demonstrate the effectiveness of MoFRR for FRR.
Poster
Chende Zheng · Ruiqi suo · Chenhao Lin · Zhengyu Zhao · Le Yang · Shuai Liu · Minghui Yang · Cong Wang · Chao Shen

[ Exhibit Hall I ]

Abstract
The evolution of video generation techniques, such as Sora, has made it increasingly easy to produce high-fidelity AI-generated videos, raising public concern over the dissemination of synthetic content. However, existing detection methodologies remain limited by their insufficient exploration of temporal artifacts in synthetic videos. To bridge this gap, we establish a theoretical framework through second-order dynamical analysis under Newtonian mechanics, subsequently extending the Second-order Central Difference features tailored for temporal artifact detection. Building on this theoretical foundation, we reveal a fundamental divergence in second-order feature distributions between real and AI-generated videos. Concretely, we propose Detection by Difference of Differences (D3), a novel training-free detection method that leverages the above second-order temporal discrepancies. We validate the superiority of our D3 on 4 open-source datasets (Gen-Video, VideoPhy, EvalCrafter, VidProM), 40 subsets in total. For example, on GenVideo, D3 outperforms the previous best method by 10.39\% (absolute) mean Average Precision. Additional experiments on time cost and post-processing operations demonstrate D3's exceptional computational efficiency and strong robust performance.
Poster
Vlad Hosu · Lorenzo Agnolucci · Daisuke Iso · Dietmar Saupe

[ Exhibit Hall I ]

Abstract
Image Quality Assessment (IQA) measures and predicts perceived image quality by human observers. Although recent studies have highlighted the critical influence that variations in the scale of an image have on its perceived quality, this relationship has not been systematically quantified.To bridge this gap, we introduce the Image Intrinsic Scale (IIS), defined as the largest scale where an image exhibits its highest perceived quality. We also present the Image Intrinsic Scale Assessment (IISA) task, which involves subjectively measuring and predicting the IIS based on human judgments. We develop a subjective annotation methodology and create the IISA-DB dataset, comprising 785 image-IIS pairs annotated by experts in a rigorously controlled crowdsourcing study with verified reliability. Furthermore, we propose WIISA (Weak-labeling for Image Intrinsic Scale Assessment), a strategy that leverages how the IIS of an image varies with downscaling to generate weak labels. Experiments show that applying WIISA during the training of several IQA methods adapted for IISA consistently improves the performance compared to using only ground-truth labels. We will release the code, dataset, and pre-trained models upon acceptance.
Poster
Darshan Thaker · Abhishek Goyal · Rene Vidal

[ Exhibit Hall I ]

Abstract
Image restoration aims to recover high-quality images from degraded observations. When the degradation process is known, the recovery problem can be formulated as an inverse problem, and in a Bayesian context, the goal is to sample a clean reconstruction given the degraded observation. Recently, modern pretrained diffusion models have been used for image restoration by modifying their sampling procedure to account for the degradation process. However, these methods often rely on certain approximations that can lead to significant errors and compromised sample quality. In this paper, we propose a simple modification to existing diffusion-based restoration methods that exploits the frequency structure of the reverse diffusion process. Specifically, our approach, denoted as Frequency Guided Posterior Sampling (FGPS), introduces a time-varying low-pass filter in the frequency domain of the measurements, progressively incorporating higher frequencies during the restoration process. We provide the first rigorous analysis of the approximation error of FGPS for linear inverse problems under distributional assumptions on the space of natural images, demonstrating cases where previous works can fail dramatically. On real-world data, we develop an adaptive curriculum for our method's frequency schedule based on the underlying data distribution. FGPS significantly improves performance on challenging image restoration tasks including motion deblurring …
Poster
Yixing Lu · Junting Dong · YoungJoong Kwon · Qin Zhao · Bo Dai · Fernando De la Torre

[ Exhibit Hall I ]

Abstract
We introduce a generalizable and unified framework to synthesize view-consistent and temporally coherent avatars from a single image, addressing the challenging problem of single-image avatar generation. While recent methods employ diffusion models conditioned on human templates like depth or normal maps, they often struggle to preserve appearance information due to the discrepancy between sparse driving signals and the actual human subject, resulting in multi-view and temporal inconsistencies. Our approach bridges this gap by combining the reconstruction power of regression-based 3D human reconstruction with the generative capabilities of a diffusion model. The dense driving signal from the initial reconstructed human provides comprehensive conditioning, ensuring high-quality synthesis faithful to the reference appearance and structure. Additionally, we propose a unified framework that enables the generalization learned from novel pose synthesis on in-the-wild videos to naturally transfer to novel view synthesis. Our video-based diffusion model enhances disentangled synthesis with high-quality view-consistent renderings for novel views and realistic non-rigid deformations in novel pose animation. Results demonstrate the superior generalization ability of our method across in-domain and out-of-domain in-the-wild datasets.
Poster
Zhi-Wei Xia · Kun-Yu Lin · Yuan-Ming Li · Wei-Jin Huang · Xian-Tuo Tan · Wei-Shi Zheng

[ Exhibit Hall I ]

Abstract
This work focuses on the task of privacy-preserving action recognition, which aims to protect individual privacy in action videos without compromising recognition performance. Despite recent advancements, existing privacy-preserving action recognition models still struggle with video domain shifts. To address this challenge, this work aims to develop transferable privacy-preserving action recognition models, by leveraging labeled videos from the source domain and unlabeled videos from the target domain. This work contributes a novel method named GenPriv, which improves the transferability of privacy-preserving models by generative decoupled learning. Inspired by the fact that privacy-sensitive information in action videos primarily comes from the static human appearances, our GenPriv decouples video features into static and dynamic aspects and then removes privacy-sensitive content from static action features.We propose a generative architecture named ST-VAE, complemented by Spatial Consistency and Temporal Alignment losses, to enhance decoupled learning. Experimental results on three benchmarks with diverse domain shifts demonstrate the effectiveness of our proposed GenPriv.
Poster
Xiao Li · Qi Chen · Xiulian Peng · Kai Yu · Xie Chen · Yan Lu

[ Exhibit Hall I ]

Abstract
We propose a novel and general framework to disentangle video data into its dynamic motion and static content components. Our proposed method is a self-supervised pipeline with less assumptions and inductive biases than previous works: it utilizes a transformer-based architecture to jointly generate flexible implicit features for frame-wise motion and clip-wise content, and incorporates a low-bitrate vector quantization as an information bottleneck to promote disentanglement and form a meaningful discrete motion space. The bitrate-controlled latent motion and content are used as conditional inputs to a denoising diffusion model to facilitate self-supervised representation learning. We validate our disentangled representation learning framework on real world talking head videos with motion transfer and auto-regressive motion generation tasks. Furthermore, we also show that our method can generalize to other type of video data, such as pixel sprites of 2D cartoon characters. Our work presents a new perspective on self-supervised learning of disentangled video representations, contributing to the broader field of video analysis and generation.
Poster
Fu-Zhao Ou · Chongyi Li · Shiqi Wang · Sam Kwong

[ Exhibit Hall I ]

Abstract
Recent advancements in Face Image Quality Assessment (FIQA) models trained on real large-scale face datasets are pivotal in guaranteeing precise face recognition in unrestricted scenarios. Regrettably, privacy concerns lead to the discontinuation of real datasets, underscoring the pressing need for a tailored synthetic dataset dedicated to the FIQA task. However, creating satisfactory synthetic datasets for FIQA is challenging. It requires not only controlling the intra-class degradation of different quality factors (e.g., pose, blur, occlusion) for the pseudo-identity generation but also designing an optimized quality characterization method for quality annotations. This paper undertakes the pioneering initiative to establish a Synthetic dataset for FIQA (SynFIQA) based on a hypothesis: accurate quality labeling can be achieved through the utilization of quality priors across the diverse domains involved in quality-controllable generation. To validate this, we tailor the generation of reference and degraded samples by aligning pseudo-identity image features in stable diffusion latent space, editing 3D facial parameters, and customizing dual text prompts and post-processing. Furthermore, we propose a novel quality characterization method that thoroughly examines the relationship of Multiple Reference representations among recognition embedding, spatial, and visual-language domains to acquire annotations essential for fitting FIQA models (MR-FIQA). Extensive experiments confirm the validity of our …
Poster
Qingyuan Liu · Ke Lv · Kun Dong · Jian Xue · Zehai Niu · Jinbao Wang

[ Exhibit Hall I ]

Abstract
Recent advances in text-driven motion generation have shown notable advancements. However, these works are typically limited to standardized skeletons and rely on a cumbersome retargeting process to adapt to varying skeletal configurations of diverse characters. In this paper, we present OmniSkel, a novel framework that can directly generate high-quality human motions for any user-defined skeleton without retargeting. Specifically, we introduce skeleton-aware RVQ-VAE, which utilizes Kinematic Graph Cross Attention (K-GCA) to effectively integrate skeletal information into the motion encoding and reconstruction. Moreover, we propose a simple yet effective training-free approach, Motion Restoration Optimizer (MRO), to ensure zero bone length error while preserving motion smoothness. To facilitate our research, we construct SkeleMotion-3D, a large-scale text-skeleton-motion dataset based on HumanML3D. Extensive experiments demonstrate the excellent robustness and generalization of our method.The dataset and source code will be made public upon acceptance of this paper.
Poster
Jiazheng Liu · Zejin Wang · Bohao Chen · Hua Han

[ Exhibit Hall I ]

Abstract
Self-supervised blind denoising for Poisson-Gaussian noise remains a challenging task. Pseudo-supervised pairs constructed from single noisy images re-corrupt the signal and degrade the performance. The visible blindspots solve the information loss in masked inputs. However, without explicitly noise sensing, mean square error as an objective function cannot adjust denoising intensities for dynamic noise levels, leading to noticeable residual noise. In this paper, we propose Blind2Sound, a simple yet effective approach to overcome residual noise in denoised images. The proposed adaptive re-visible loss senses noise levels and performs personalized denoising without noise residues while retaining the signal lossless. The theoretical analysis of intermediate medium gradients guarantees stable training, while the Cramer Gaussian loss acts as a regularization to facilitate the accurate perception of noise levels and improve the performance of the denoiser. Experiments on synthetic and real-world datasets show the superior performance of our method, especially for single-channel images. The code is publicly available from this link.
Poster
Xiaohang Yang · Qing Wang · Jiahao Yang · Gregory Slabaugh · Shanxin Yuan

[ Exhibit Hall I ]

Abstract
Motion retargeting seeks to faithfully replicate the spatio-temporal motion characteristics of a source character onto a target character with a different body shape. Apart from motion semantics preservation, ensuring geometric plausibility and maintaining temporal consistency are also crucial for effective motion retargeting. However, many existing methods prioritize either geometric plausibility or temporal consistency. Neglecting geometric plausibility results in interpenetration while neglecting temporal consistency leads to motion jitter.In this paper, we propose a novel sequence-to-sequence model for seamless \textbf{S}patial-\textbf{T}emporal \textbf{a}ware motion \textbf{R}etargeting (\textbf{STaR}), with penetration and consistency constraints. STaR consists of two modules: (1) a spatial module that incorporates dense shape representation and a novel limb penetration constraint to ensure geometric plausibility while preserving motion semantics, and (2) a temporal module that utilizes a temporal transformer and a novel temporal consistency constraint to predict the entire motion sequence at once while enforcing multi-level trajectory smoothness. The seamless combination of the two modules helps us achieve a good balance between the semantic, geometric, and temporal targets. Extensive experiments on the Mixamo and ScanRet datasets demonstrate that our method produces plausible and coherent motions while significantly reducing interpenetration rates compared with other approaches. Code and model will be released upon acceptance.
Poster
Ying Guo · Xi Liu · Cheng Zhen · Pengfei Yan · Xiaoming Wei

[ Exhibit Hall I ]

Abstract
Face-to-face communication, as a common human activity, motivates the research on interactive head generation. A virtual agent can generate motion responses with both listening and speaking capabilities based on the audio or motion signals of the other user and itself. However, previous clip-wise generation paradigm or explicit listener/speaker generator-switching methods have limitations in future signal acquisition, contextual behavioral understanding, and switching smoothness, making it challenging to be real-time and realistic.In this paper, we propose an autoregressive (AR) based frame-wise framework called ARIG to realize the real-time generation with better interaction realism. To achieve real-time generation, we model motion prediction as a non-vector-quantized AR process. Unlike discrete codebook-index prediction, we represent motion distribution using diffusion procedure, achieving more accurate predictions in continuous space. To improve interaction realism, we emphasize interactive behavior understanding (IBU) and detailed conversational state understanding (CSU). In IBU, based on dual-track dual-modal signals, we summarize short-range behaviors through bidirectional-integrated learning and perform contextual understanding over long ranges. In CSU, we use voice activity signals and context features of IBU to understand the various states (interruption, feedback, pause, etc.) that exist in actual conversations. These serve as conditions for the final progressive motion prediction. Extensive experiments have verified the …
Poster
Wenbo Yang · Zhongling Wang · Zhou Wang

[ Exhibit Hall I ]

Abstract
Image degradation synthesis is highly desirable in a wide variety of applications ranging from image restoration to simulating artistic effects. Existing models are designed to generate one specific or a narrow set of degradations, which often require user-provided degradation parameters. As a result, they lack the generalizability to synthesize degradations beyond their initial design or adapt to other applications. Here we propose the *first* universal degradation model that can synthesize a broad spectrum of complex and realistic degradations containing both homogeneous (global) and inhomogeneous (spatially varying) components. Our model automatically extracts and disentangles homogeneous and inhomogeneous degradation features, which are later used for degradation synthesis without user intervention. A disentangle-by-compression method is proposed to separate degradation information from images. Two novel modules for extracting and incorporating inhomogeneous degradations are created to model inhomogeneous components in complex degradations. We demonstrate the model’s accuracy and adaptability in film-grain simulation and blind image restoration tasks. The demo video (anonymized version available supplementary material), code, and dataset of this project will be released.
Poster
Wanpeng Zhang · Yicheng Feng · Hao Luo · Yijiang Li · Zihao Yue · Sipeng Zheng · Zongqing Lu

[ Exhibit Hall I ]

Abstract
Multimodal large language models (MLLMs) have made significant progress in vision-language understanding, yet effectively aligning different modalities remains a fundamental challenge. We present a framework that unifies multimodal understanding by applying byte-pair encoding to visual tokens. Unlike conventional approaches that rely on modality-specific encoders, our method directly incorporates structural information into visual tokens, mirroring successful tokenization strategies in text-only language models. We introduce a priority-guided encoding scheme that considers both frequency and spatial consistency, coupled with a multi-stage training procedure based on curriculum-driven data composition. These enhancements enable the transformer model to better capture cross-modal relationships and reason with visual information. Comprehensive experiments demonstrate improved performance across diverse vision-language tasks. By bridging the gap between visual and textual representations, our approach contributes to the advancement of more capable and efficient multimodal foundation models.
Poster
Chen Li · Chinthani Sugandhika · Ee Yeo Keat · Eric Peh · Hao Zhang · HONG YANG · Deepu Rajan · Basura Fernando

[ Exhibit Hall I ]

Abstract
Existing human motion Q\&A methods rely on explicit program execution, where the requirement for manually defined functional modules may limit the scalability and adaptability. To overcome this, we propose an implicit program-guided motion reasoning (IMoRe) framework that unifies reasoning across multiple query types without manually designed modules. Unlike existing implicit reasoning approaches that infer reasoning operations from question words, our model directly conditions on structured program functions, ensuring a more precise execution of reasoning steps. Additionally, we introduce a program-guided reading mechanism, which dynamically selects multi-level motion representations from a pretrained motion Vision Transformer (ViT), capturing both high-level semantics and fine-grained motion cues. The reasoning module iteratively refines memory representations, leveraging structured program functions to extract relevant information for different query types. Our model achieves state-of-the-art performance on Babel-QA and generalizes to a newly constructed motion Q\&A dataset based on HuMMan, demonstrating its adaptability across different motion reasoning datasets.
Poster
Qi Bi · Yixian Shen · Jingjun Yi · Gui-Song Xia

[ Exhibit Hall I ]

Abstract
Vision Foundation Model (VFM) provides an inherent generalization ability to unseen domains for downstream tasks.However, fine-tuning VFM to parsing various adverse scenes (\eg, fog, snow, night) is particularly challenging, as these samples are difficult to collect and annotate.Using easy-to-acquire clear scenes as the source domain is a feasible solution, but a huge domain gap exists between them and clear scenes due to dramatically different scene appearance.In this paper, we propose \texttt{AdaDCP} to effectively fine-tune a VFM for adverse scene segmentation, by only generalizing from a clear source domain. Interestingly, the frequency bands from a VFM exhibit either variant or invariant properties on various adverse weather conditions after discerete cosine transform. Therefore, our \texttt{AdaDCP} is enpowered by three key components: (1) weather-invariant band adapation that provides a foundation to enhance the robustness to adverse scenes; (2) weather-variant band adapation that preceives the weather-specific information from each type of adverse scenes; (3) weather-invariant band alignment that implictly enforces the weather-variant bands to progressively incoperate the weather-invariant information, therefore mitigating the clear-to-adverse domain gap.Experiments conducted on eight unseen adverse scene segmentation datasets show its state-of-the-art performance.
Poster
Zhehui Wu · Yong Chen · Naoto Yokoya · Wei He

[ Exhibit Hall I ]

Abstract
Hyperspectral images (HSIs) often suffer from diverse and unknown degradations during imaging, leading to severe spectral and spatial distortions. Existing HSI restoration methods typically rely on specific degradation assumptions, limiting their effectiveness in complex scenarios. In this paper, we propose MP-HSIR, a novel multi-prompt framework that effectively integrates spectral, textual, and visual prompts to achieve universal HSI restoration across diverse degradation types and intensities. Specifically, we develop a prompt-guided spatial-spectral transformer, which incorporates spatial self-attention and a prompt-guided dual-branch spectral self-attention. Since degradations affect spectral features differently, we introduce spectral prompts in the local spectral branch to provide universal low-rank spectral patterns as prior knowledge for enhancing spectral reconstruction. Furthermore, the text-visual synergistic prompt fuses high-level semantic representations with fine-grained visual features to encode degradation information, thereby guiding the restoration process. Extensive experiments on 9 HSI restoration tasks, including all-in-one scenarios, generalization tests, and real-world cases, demonstrate that MP-HSIR not only consistently outperforms existing all-in-one methods but also surpasses state-of-the-art task-specific approaches across multiple tasks.
Poster
Tianfang Zhu · Hongyang Zhou · Anan LI

[ Exhibit Hall I ]

Abstract
Capturing the spatial patterns of neurons and generating high-fidelity morphological data remain critical challenges in developing biologically realistic large-scale brain network models. Existing methods fail to reconcile anatomical complexity with diversity and computational scalability. We propose MorphoGen, a hierarchical framework integrating global structure prediction through denoising diffusion probabilistic models (DDPMs) with local neurites optimization. The pipeline initiates with DDPM-generated coarse-grained neuronal point clouds, followed by skeletonization and growth-guided linking to derive plausible tree-like structures, and culminates in natural neural fibers refinement via a pragmatic smoothing network. Comprehensive evaluations across three distinct long-range projection neuron datasets demonstrate that the proposed method improves 1-Nearest Neighbor Accuracy by approximately 12\% on average compared to state-of-the-art baseline, reduces average training time by around 55\%, and aligns the distributions of several morphometrics with real data. This work establishes a novel global-to-local paradigm for neuronal morphology generation, offering a more direct and efficient approach compared to current branch-sequential modeling methods. Code is provided in the supplementary materials and will be publicly available upon acceptance.
Poster
Xiaojie Zhang · Yuanfei Wang · Ruihai Wu · Kunqi Xu · Yu Li · Liuyu Xiang · Hao Dong · Zhaofeng He

[ Exhibit Hall I ]

Abstract
Articulated objects pose diverse manipulation challenges for robots. Since their internal structures are not directly observable, robots must adaptively explore and refine actions to generate successful manipulation trajectories. While existing works have attempted cross-category generalization in adaptive articulated object manipulation, two major challenges persist: (1) the geometric diversity of real-world articulated objects complicates visual perception and understanding, and (2) variations in object functions and mechanisms hinder the development of a unified adaptive manipulation strategy.To address these challenges, we propose \textbf{AdaRPG}, a novel framework that leverages foundation models to extract object parts, which exhibit greater local geometric similarity than entire objects, thereby enhancing visual affordance generalization for functional primitive skills. To support this, we construct a part-level affordance annotation dataset to train the affordance model. Additionally, AdaRPG utilizes the common knowledge embedded in foundation models to reason about complex mechanisms and generate high-level control codes that invoke primitive skill functions based on part affordance inference.Simulation and real-world experiments demonstrate AdaRPG’s strong generalization ability across novel articulated object categories.
Poster
Yuang Feng · Shuyong Gao · Fuzhen Yan · Yicheng Song · Lingyi Hong · Junjie Hu · Wenqiang Zhang

[ Exhibit Hall I ]

Abstract
Video Camouflaged Object Detection (VCOD) aims to segment objects whose appearances closely resemble their surroundings, posing a challenging and emerging task. Existing vision models often struggle in such scenarios due to the indistinguishable appearance of camouflaged objects and the insufficient exploitation of dynamic information in videos. To address these challenges, we propose an end-to-end VCOD framework inspired by human memory-recognition, which leverages historical video information by integrating memory reference frames for camouflaged sequence processing. Specifically, we design a dual-purpose decoder that simultaneously generates predicted masks and scores, enabling reference frame selection based on scores while introducing auxiliary supervision to enhance feature extraction.Furthermore, this study introduces a novel reference-guided multilevel asymmetric attention mechanism, effectively integrating long-term reference information with short-term motion cues for comprehensive feature extraction. By combining these modules, we develop the \textbf{Scoring, Remember, and Reference (SRR)} framework, which efficiently extracts information to locate targets and employs memory guidance to improve subsequent processing. With its optimized module design and effective utilization of video data, our model achieves significant performance improvements, surpassing existing approaches by 10\% on benchmark datasets while requiring fewer parameters (54M) and only a single pass through the video. The code will be made publicly available.
Poster
Zhengyuan Peng · Jianqing Xu · Yuge Huang · Jinkun Hao · Shouhong Ding · zhizhong zhang · Xin TAN · Lizhuang Ma

[ Exhibit Hall I ]

Abstract
Stylized face recognition is the task of recognizing generated faces with the same ID across diverse stylistic domains (e.g., anime, painting, cyberpunk styles). This emerging field plays a vital role in the governance of generative image, serving the primary objective: Recognize the ID information of stylized faces to detect potential infringements of portrait rights. Despite its importance, progress in stylized face recognition has been hindered by the lack of large-scale, stylistically diverse datasets. To address this gap, we introduce the \textbf{Stylized-Face} dataset, which is the first dataset specifically designed for stylized face recognition. Stylized-Face dataset includes 4.6 million images across 62k IDs, specifically curated to enhance model performance in stylized face recognition tasks. To ensure data quality (i.e., ID preservation) at this massive scale, we implement a semi-automated pipeline for large-scale data cleaning. Based on the Stylized-Face dataset, we establish three benchmarks to evaluate the robustness and generalization of recognition models across various scenarios, including within-distribution performance, cross-prompt generalization, and cross-method generalization, which target key challenges in stylized face recognition. Experimental results demonstrate that models trained on Stylized-Face achieve remarkable improvements in both stylized face recognition performance (a 15.9% improvement in TAR at FAR=1e-4) and generalization (a 13.3% improvement in …
Poster
Shivangi Aneja · Artem Sevastopolsky · Tobias Kirschstein · Justus Thies · Angela Dai · Matthias Nießner

[ Exhibit Hall I ]

Abstract
We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photorealistic and personalized multi-view consistent 3D human head avatars from spoken audio at real-time rendering rates. To capture the expressive and detailed nature of human heads, including skin furrowing and fine facial movements, we propose to couple speech signal with 3D Gaussian splatting to create photorealistic and temporally coherent motion sequences. We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize dynamic facial details at real-time rendering. Next, we devise an audio-conditioned transformer model to extract lip and wrinkle features from the audio input and combine with our 3D avatar by performing joint 3D sequence refinement to synthesize photorealistic animations. To the best of our knowledge, this is the first work for generating photorealistic multi-view 3D head avatar sequence only from spoken audio, representing a significant advancement in the field of audio-driven 3D facial animation. In the absence of high-quality multi-view talking face dataset, we captured a new large-scale multi-view dataset of audio-visual sequences of native English speakers and diverse facial geometry. GaussianSpeech achieves state-of-the-art quality consistent with the avatar's speaking style.
Poster
Jie Zhu · Yiyang Su · Minchul Kim · Anil Jain · Xiaoming Liu

[ Exhibit Hall I ]

Abstract
Whole-body biometric recognition is a challenging multi-modal task that integrates various biometric modalities, including face, gait, and body. This integration is essential for overcoming the limitations of unimodal systems. Traditionally, whole-body recognition involves deploying different models to process multiple modalities, achieving the final outcome by score-fusion (e.g., weighted averaging similarity matrices from each model). However, these conventional methods may overlook the variations in score distributions of individual modalities, making it challenging to improve final performance. In this work, we present $\textbf{Q}$uality-guided $\textbf{M}$ixture of score-fusion $\textbf{E}$xperts (QME), a novel framework designed for improving whole-body biometric recognition performance through a learnable score-fusion strategy using a Mixture of Experts (MoE). We introduce a novel pseudo quality loss for quality estimation with a modality-specific Quality Estimator (QE), and a score triplet loss to improve the metric performance. Extensive experiments on multiple whole-body biometric datasets demonstrate the effectiveness of our proposed approach, achieving state-of-the-art results across various metrics compared to baseline methods. Our method is effective for multi-modal and multi-model, addressing key challenges such as model misalignment in the similarity score domain and variability in data quality. Code will be publicly released upon publication.
Poster
Xinran Ling · Chen Zhu · Meiqi Wu · Hangyu Li · Xiaokun Feng · Cundian Yang · Aiming Hao · Jiashu Zhu · Jiahong Wu · Xiangxiang Chu

[ Exhibit Hall I ]

Abstract
Video generation has advanced rapidly, improving evaluation methods, yet assessing video's motion remains a major challenge. Specifically, there are two key issues: 1) current motion metrics do not fully align with human perceptions; 2) the existing motion prompts are limited. Based these findings, we introduce VMBench—a comprehensive Video Motion Benchmark that has perception-aligned motion metrics and features the most diverse types of motion. VMBench has several appealing properties: (1) Perception-Driven Motion Evaluation Metrics, we identify five dimensions based on human perception in motion video assessment and develop fine-grained evaluation metrics, providing deeper insights into models' strengths and weaknesses in motion quality. (2) Meta-Guided Motion Prompt Generation, a structured method that extracts meta-information, generates diverse motion prompts with LLMs, and refines them through human-AI validation, resulting in a multi-level prompt library covering six key dynamic scene dimensions. (3) Human-Aligned Validation Mechanism, we provide human preference annotations to validate our benchmarks, with our metrics achieving an average 35.3\% improvement in Spearman’s correlation over baseline methods. This is the first time that the quality of motion in videos has been evaluated from the perspective of human perception alignment. Additionally, we will soon release VMBench as an open-source benchmark, setting a new standard for …
Poster
Haonan He · Yufeng Zheng · Jie Song

[ Exhibit Hall I ]

Abstract
Photorealistic 3D head avatars are vital for telepresence, gaming, and VR. However, most methods focus solely on facial regions, ignoring natural hand-face interactions, such as a hand resting on the chin or fingers gently touching the cheek, which convey cognitive states like pondering. In this work, we present a novel framework that jointly learns detailed head avatars and the non-rigid deformations induced by hand-face interactions.There are two principal challenges in this task. First, naively tracking hand and face separately fails to capture their relative poses. To overcome this, we propose to combine depth order loss with contact regularization during pose tracking, ensuring correct spatial relationships between the face and hand. Second, no publicly available priors exist for hand-induced deformations, making them non-trivial to learn from monocular videos. To address this, we learn a PCA basis specific to hand-induced facial deformations from a face-hand interaction dataset. This reduces the problem to estimating a compact set of PCA parameters rather than a full spatial deformation field. Furthermore, inspired by physics-based simulation, we incorporate a contact loss that provides additional supervision, significantly reducing interpenetration artifacts and enhancing the physical plausibility of the results.We evaluate our approach on RGB(D) videos captured by an iPhone. …
Poster
Elena Buglakova · Anwai Archit · Edoardo D'Imprima · Julia Mahamid · Constantin Pape · Anna Kreshuk

[ Exhibit Hall I ]

Abstract
Segmentation of very large images is a common problem in microscopy, medical imaging or remote sensing. The problem is usually addressed by sliding window inference, which can theoretically lead to seamlessly stitched predictions. However, in practice many of the popular pipelines still suffer from tiling artifacts. We investigate the root cause of these issues and show that they stem from the normalization layers within the neural networks. We propose indicators to detect normalization issues and further explore the trade-offs between artifact-free and high-quality predictions, using three diverse microscopy datasets as examples. Finally, we propose to use BatchRenorm as the most suitable normalization strategy, which effectively removes tiling artifacts and enhances transfer performance, thereby improving the reusability of trained networks for new datasets.
Poster
Hanyuan Liu · Chengze Li · Minshan Xie · Wang Zhenni · Jiawen Liang · Chi LEUNG · Tien-Tsin Wong

[ Exhibit Hall I ]

Abstract
While digitally acquired photographs have been dominating since around 2000, there remains a huge amount of legacy photographs being acquired by optical cameras and are stored in the form of negative films. In this paper, we focus on the unique phenomenon of deterioration on negative films and propose the first high-quality 35mm negative film dataset BlueNeg for restoring channel-heterogeneous deterioration. We would like to bring attention to this under-explored research area of image restoration on channel-heterogeneous deterioration. However, a large portion of the collected negative films are already contaminated, so we do not have non-corrupted version or the ground truth of these photos, which poses a challenge in evaluating the restoration performance. To address this, we leverage the printed photos from the same negative films, which do not suffer from the channel-heterogeneous deterioration, for quantitative evaluation. We propose a reverse-developing process to generate the estimated ground truth from the printed photos and design an evaluation protocol for evaluating the restoration performance. With the collected data and the proposed evaluation protocol, we find existing image restoration methods cannot perform well on our dataset, requiring specially designed tools for better restoration. We hope that our dataset and benchmark will inspire future research …
Poster
Junyu Shi · Lijiang LIU · Yong Sun · Zhiyuan Zhang · JINNI ZHOU · Qiang Nie

[ Exhibit Hall I ]

Abstract
Scaling up motion datasets is crucial to enhance motion generation capabilities. However, training on large-scale multi-source datasets introduces data heterogeneity challenges due to variations in motion content. To address this, we propose Generative Pretrained Multi-path Motion Model (GenM$^3$), a comprehensive framework designed to learn unified motion representations. GenM$^3$ comprises two components: 1) a Multi-Expert VQ-VAE (MEVQ-VAE) that adapts to different dataset distributions to learn a unified discrete motion representation, and 2) a Multi-path Motion Transformer (MMT) that improves intra-modal representations by using separate modality-specific pathways, each with densely activated experts to accommodate variations within that modality, and improves inter-modal alignment by the text-motion shared pathway. To enable large-scale training, we integrate and unify 11 high-quality motion datasets (approximately 220 hours of motion data) and augment it with textual annotations (nearly 10,000 motion sequences labeled by a large language model and 300+ by human experts). After training on our integrated dataset, GenM$^3$ achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin. It also demonstrates strong zero-shot generalization on IDEA400 dataset, highlighting its effectiveness and adaptability across diverse motion scenarios.
Poster
Seongmin Park · Hyungmin Kim · Sangwoo kim · Wonseok Jeon · Juyoung Yang · Byeongwook Jeon · Yoonseon Oh · Jungwook Choi

[ Exhibit Hall I ]

Abstract
Deep neural network (DNN)-based policy models, such as vision-language-action (VLA) models, excel at automating complex decision-making from multi-modal inputs. However, scaling these models greatly increases computational overhead, complicating deployment in resource-constrained settings like robot manipulation and autonomous driving. To address this, we propose Saliency-Aware Quantized Imitation Learning (\method), which combines quantization-aware training with a selective loss-weighting strategy for mission-critical states. By identifying these states via saliency scores and emphasizing them in the training loss, \method preserves decision fidelity under low-bit precision. We validate \method's generalization capability across extensive simulation benchmarks with environment variations, real-world tasks, and cross-domain tasks (self-driving, physics simulation), consistently recovering full-precision performance. Notably, a 4-bit weight-quantized VLA model for robotic manipulation achieves up to 2.5$\times$ speedup and 2.5$\times$ energy savings on an edge GPU with minimal accuracy loss. These results underline \method’s potential for efficiently deploying large IL-based policy models on resource-limited devices.
Poster
Sungwoo Cho · Jeongsoo Choi · Sungnyun Kim · Se-Young Yun

[ Exhibit Hall I ]

Abstract
Despite recent advances in text-to-speech (TTS) models, audio-visual-to-audio-visual (AV2AV) translation still faces a critical challenge: maintaining speaker consistency between the original and translated vocal and facial features. To address this issue, we propose a conditional flow matching (CFM) zero-shot audio-visual renderer that utilizes strong dual guidance from both audio and visual modalities. By leveraging multi-modal guidance with CFM, our model robustly preserves speaker-specific characteristics and significantly enhances zero-shot AV2AV translation abilities. For the audio modality, we enhance the CFM process by integrating detailed speaker embeddings with x-vectors, which serve to bolster speaker consistency. Additionally, we convey emotional nuances to the face rendering module. The guidance provided by both audio and visual cues remains independent of semantic or linguistic content, allowing our renderer to effectively handle zero-shot translation tasks for monolingual speakers in different languages. We empirically demonstrate that the inclusion of high-quality mel-spectrograms conditioned on facial information not only enhances the quality of the synthesized speech but also positively influences facial generation, leading to overall performance improvements.
Poster
Thomas Carr · Depeng Xu · Shuhan Yuan · Aidong Lu

[ Exhibit Hall I ]

Abstract
Capturing and visualizing motion using skeleton-based techniques is a key aspect of computer vision, particularly in virtual reality (VR) settings. Its popularity has surged, driven by the simplicity of obtaining skeleton data and the growing appetite for virtual interaction. Although this skeleton data appears to be non-identifiable, it can be exploited to derive personally identifiable information (PII), posing a risk of inadvertent privacy breaches. In this paper, we explore the application of motion retargeting and its ability to mitigate privacy leakages. Motion retargeting can effectively transfer the motion from an initial user onto a dummy skeleton with the purpose of hiding PII. We propose a Privacy-centric Deep Motion Retargeting model (PMR), which mitigates the PII through adversarial learning. In our evaluation, our proposed model achieves motion retargeting performance on par with the current state-of-the-art models. More importantly, it effectively prevents the attackers from identifying the initial user.
Poster
Junyoung Lim · Jaewoo Ahn · Gunhee Kim

[ Exhibit Hall I ]

Abstract
Generating accurate, informative, and hallucination-free captions for charts remains challenging for vision language models, primarily due to the lack of large-scale, high-quality datasets of real-world charts. However, existing real-world chart datasets suffer from the inclusion of extraneous information that cannot be inferred from the chart and failure to sufficiently capture structural elements and key insights. Therefore, we introduce ChartCap, a large-scale dataset of 565K real-world chart images paired with type-specific, dense captions that exclude extraneous information and highlight both structural elements and key insights in detail. To build ChartCap, we design a four-stage pipeline that generates captions using only the discernable data from the chart and employ a cycle consistency-based human verification, which accelerates quality control without sacrificing accuracy. Additionally, we propose a novel metric, the Visual Consistency Score, which evaluates caption quality by measuring the similarity between the chart regenerated from a caption and the original chart, independent of reference captions. Extensive experiments confirms that models fine-tuned on ChartCap consistently generate more accurate and informative captions with reduced hallucinations, surpassing not only open-source and proprietary models but also even human-annotated captions.
Poster
Kelin Yu · Sheng Zhang · Harshit Soora · Furong Huang · Heng Huang · Pratap Tokekar · Ruohan Gao

[ Exhibit Hall I ]

Abstract
Recent advances have shown that video generation models can enhance robot learning by deriving effective robot actions through inverse dynamics. However, these methods heavily depend on the quality of generated data and struggle with fine-grained manipulation due to the lack of environment feedback. While video-based reinforcement learning improves policy robustness, it remains constrained by the uncertainty of video generation and the challenges of collecting large-scale robot datasets for training diffusion models. To address these limitations, we propose GenFlowRL, which derives shaped rewards from generated flow trained from easy-to-collect cross-embodiment datasets. This enables learning generalizable and robust policies from expert demostrations using low-dimensional, object-centric features. Experiments on 10 manipulation tasks, both in simulation and real-world cross-embodiment evaluations, demonstrate that GenFlowRL effectively leverages manipulation features extracted from generated object-centric flow, consistently achieving superior performance across diverse and challenging scenarios.
Poster
Congyi Fan · Jian Guan · Xuanjia Zhao · Dongli Xu · Youtian Lin · Tong Ye · Pengming Feng · Haiwei Pan

[ Exhibit Hall I ]

Abstract
Automatically generating natural, diverse and rhythmic human dance movements driven by music is vital for virtual reality and film industries. However, generating dance that naturally follows music remains a challenge, as existing methods lack proper beat alignment and exhibit unnatural motion dynamics. In this paper, we propose Danceba, a novel framework that leverages gating mechanism to enhance rhythm-aware feature representation for music-driven dance generation, which achieves highly aligned dance poses with enhanced rhythmic sensitivity. Specifically, we introduce Phase-Based Rhythm Extraction (PRE) to precisely extract rhythmic information from musical phase data, capitalizing on the intrinsic periodicity and temporal structures of music. Additionally, we propose Temporal-Gated Causal Attention (TGCA) to focus on global rhythmic features, ensuring that dance movements closely follow the musical rhythm. We also introduce Parallel Mamba Motion Modeling (PMMM) architecture to separately model upper and lower body motions along with musical features, thereby improving the naturalness and diversity of generated dance movements. Extensive experiments confirm that Danceba outperforms state-of-the-art methods, achieving significantly better rhythmic alignment and motion diversity.
Poster
Tingting Zheng · Hongxun Yao · Kui Jiang · Yi Xiao · Sicheng Zhao

[ Exhibit Hall I ]

Abstract
Recent advances in selective state space models (Mamba) have shown great promise in whole slide image (WSI) classification. Despite this, WSIs contain explicit local redundancy (similar patches) and irrelevant regions (uninformative instances), posing significant challenges for Mamba-based multi-instance learning (MIL) methods in capturing global representations. Furthermore, bag-level approaches struggle to extract critical features from all instances, while group-level methods fail to adequately account for tumor dispersion and intrinsic correlations across groups, leading to suboptimal global representations. To address these issues, we propose group masking Mamba (GMMamba), a novel framework that combines two elaborate modules: (1) intra-group masking Mamba (IMM) for selective instance exploration within groups, and (2) cross-group super-feature sampling (CSS) to ameliorate long-range relation learning. Specifically, IMM adaptively predicts sparse masks to filter out features with low attention scores (i.e., uninformative patterns) during bidirectional Mamba modeling, facilitating the removal of instance redundancies for compact local representation. For improved bag prediction, the CSS module further aggregates sparse group representations into discriminative features, effectively grasping comprehensive dependencies among dispersed and sparse tumor regions inherent in large-scale WSIs. Extensive experiments on four datasets demonstrate that GMMamba outperforms the state-of-the-art ACMIL by 2.2\% and 6.4\% in accuracy on the TCGA-BRCA and TCGA-ESCA datasets, …
Poster
Sindhu Hegde · K R Prajwal · Taein Kwon · Andrew Zisserman

[ Exhibit Hall I ]

Abstract
Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations: (i) gesture-based retrieval, (ii) gestured word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs), across all three tasks. Further analysis reveals that speech and text modalities capture distinct gesture-related signals, underscoring the advantages of learning a shared tri-modal embedding space. All code, models, and data annotations will be released to support future research.
Poster
Inwoo Hwang · Jinseok Bae · Donggeun Lim · Young Min Kim

[ Exhibit Hall I ]

Abstract
Creating expressive character animations is labor-intensive, requiring intricate manual adjustment of animators across space and time. Previous works on controllable motion generation often rely on a predefined set of dense spatio-temporal specifications (e.g., dense pelvis trajectories with exact per-frame timing), limiting practicality for animators.To process high-level intent and intuitive control in diverse scenarios, we propose a practical controllable motions synthesis framework that respects sparse and flexible keyjoint signals.Our approach employs a decomposed diffusion-based motion synthesis framework that first synthesizes keyjoint movements from sparse input control signals and then synthesizes full-body motion based on the completed keyjoint trajectories. The low-dimensional keyjoint movements can easily adapt to various control signal types, such as end-effector position for diverse goal-driven motion synthesis, or incorporate functional constraints on a subset of keyjoints.Additionally, we introduce a time-agnostic control formulation, eliminating the need for frame-specific timing annotations and enhancing control flexibility. Then, the shared second stage can synthesize a natural whole-body motion that precisely satisfies the task requirement from dense keyjoint movements.We demonstrate the effectiveness of sparse and flexible keyjoint control through comprehensive experiments on diverse datasets and scenarios.
Poster
Yan Wu · Korrawe Karunratanakul · Zhengyi Luo · Siyu Tang

[ Exhibit Hall I ]

Abstract
Generating natural and physically plausible character motion remains challenging, particularly for long-horizon control with diverse guidance signals. While prior work combines high-level diffusion-based motion planners with low-level physics controllers, these systems suffer from domain gaps that degrade motion quality and require task-specific fine-tuning.To tackle this problem, we introduce UniPhys, a diffusion-based behavior cloning framework that unifies motion planning and control into a single model. UniPhys enables flexible, expressive character motion conditioned on multi-modal inputs such as text, trajectories, and goals. To address accumulated prediction errors over long sequences, UniPhys is trained with the Diffusion Forcing paradigm, learning to denoise noisy motion histories and handle discrepancies introduced by the physics simulator. This design allows UniPhys to robustly generate physically plausible, long-horizon motions. Through guided sampling, UniPhys generalizes to a wide range of control signals, including unseen ones, without requiring task-specific fine-tuning. Experiments show that UniPhys outperforms prior methods in motion naturalness, generalization, and robustness across diverse control tasks.
Poster
Kun Li · pengyu Liu · Dan Guo · Fei Wang · zhiliang wu · Hehe Fan · Meng Wang

[ Exhibit Hall I ]

Abstract
Human body actions are an important form of non-verbal communication in social interactions. This paper specifically focuses on a subset of body actions known as micro-actions, which are subtle, low-intensity body movements with promising applications in human emotion analysis. In real-world scenarios, human micro-actions often temporally co-occur, with multiple micro-actions overlapping in time, such as concurrent head and hand movements. However, current research primarily focuses on recognizing individual micro-actions while overlooking their co-occurring nature. To address this gap, we propose a new task named Multi-label Micro-Action Detection (MMAD), which involves identifying all micro-actions in a given short video, determining their start and end times, and categorizing them. Accomplishing this requires a model capable of accurately capturing both long-term and short-term action relationships to detect multiple overlapping micro-actions. To facilitate the MMAD task, we introduce a new dataset named Multi-label Micro-Action-52 (MMA-52) and propose a baseline method equipped with a dual-path spatial-temporal adapter to address the challenges of subtle visual change in MMAD. We hope that MMA-52 can stimulate research on micro-action analysis in videos and prompt the development of spatio-temporal modeling in human-centric video understanding.
Poster
Mo Zhou · Keren Ye · Mauricio Delbracio · Peyman Milanfar · Vishal Patel · Hossein Talebi

[ Exhibit Hall I ]

Abstract
Real-world image restoration is hampered by diverse degradations stemming from varying capture conditions, capture devices and post-processing pipelines. Existing works make improvements through simulating those degradations and leveraging image generative priors, however generalization to in-the-wild data remains an unresolved problem. In this paper, we focus on complex degradations, i.e., arbitrary mixtures of multiple types of known degradations, which is frequently seen in the wild. A simple yet flexible diffusion-based framework, named UniRes, is proposed to address such degradations in an end-to-end manner. It combines several specialized models during the diffusion sampling steps, hence transferring the knowledge from several well-isolated restoration tasks to the restoration of complex in-the-wild degradations. This only requires well-isolated training data for several degradation types. The framework is flexible as extensions can be added through a unified formulation, and the fidelity-quality trade-off can be adjusted through a new paradigm. Our proposed method is evaluated on both complex-degradation and single-degradation image restoration datasets. Extensive qualitative and quantitative experimental results show consistent performance gain especially for images with complex degradations.
Poster
Chun-Han Yao · Yiming Xie · Vikram Voleti · Huaizu Jiang · Varun Jampani

[ Exhibit Hall I ]

Abstract
We present Stable Video 4D 2.0 (SV4D 2.0), a multi-view video diffusion model for dynamic 3D asset generation. Compared to its predecessor SV4D, SV4D 2.0 is more robust to occlusions and large motion, generalizes better to real-world videos, and produces higher-quality outputs in terms of detail sharpness and spatio-temporal consistency. We achieve this by introducing key improvements in multiple aspects: 1) network architecture: eliminating the dependency of reference multi-views and designing blending mechanism for 3D and frame attention, 2) data: enhancing quality and quantity of training data, 3) training strategy: adopting progressive 3D-4D training for better generalization, and 4) 4D optimization: handling 3D inconsistency and large motion via 2-stage refinement and progressive frame sampling. Extensive experiments demonstrate significant performance gain by SV4D 2.0 both visually and quantitatively, achieving better detail (-14\% LPIPS) and 4D consistency (-44\% FV4D) in novel-view video synthesis and 4D optimization (-12\% LPIPS and -24\% FV4D) compared to SV4D.
Poster
Zengbin Wang · Saihui Hou · Junjie Li · Xu Liu · Chunshui Cao · Yongzhen Huang · Siye Wang · Man Zhang

[ Exhibit Hall I ]

Abstract
Modality exploration in gait recognition has been repeatedly mentioned as a core research topic, evolving from binary silhouette to some promising modalities like parsing, mesh, point clouds, etc. These latest modalities agree that silhouette is less affected by background and clothing noises, but argue it loses too much valuable discriminative information. They seek to retain the strengths of silhouette while extracting more semantic or structural information through upstream estimation for better recognition. We agree with this principle but argue that these upstream estimations are usually unstable and the resulted modalities rely on pre-defined design. Moreover, the crucial aspect of modality generalization remains underexplored. To address this, inspired by the stability and high-dimension analysis in frequency decomposition, we propose Gait-X to explore how to flexibly and stably develop a gait-specific generalized X modality from a frequency perspective. Specifically, 1) We replace upstream estimation with stable frequency decomposition and conduct a comprehensive analysis of how different frequencies impact modality and within-/cross-domain performance; 2) To enable flexible modality customization and mitigate the influence of noise and domain variations, we propose to remove irrelevant low-frequency noise and suppress high-frequency domain-specific information to form our X modality; 3) To further improve model generalization, we expand …
Poster
Taiga Yamane · Ryo Masumura · Satoshi Suzuki · Shota Orihashi

[ Exhibit Hall I ]

Abstract
Multi-View Pedestrian Tracking (MVPT) aims to track pedestrians in the form of a bird's eye view occupancy map from multi-view videos.End-to-end methods that detect and associate pedestrians within one model have shown great progress in MVPT.The motion and appearance information of pedestrians is important for the association, but previous end-to-end MVPT methods rely only on the current and its single adjacent past timestamp, discarding the past trajectories before that.This paper proposes a novel end-to-end MVPT method called Multi-View Trajectory Tracker (MVTrajecter) that utilizes information from multiple timestamps in past trajectories for robust association.MVTrajecter introduces trajectory motion cost and trajectory appearance cost to effectively incorporate motion and appearance information, respectively.These costs calculate which pedestrians at the current and each past timestamp are likely identical based on the information between those timestamps.Even if a current pedestrian could be associated with a false pedestrian at some past timestamp, these costs enable the model to associate that current pedestrian with the correct past trajectory based on other past timestamps.In addition, MVTrajecter effectively captures the relationships between multiple timestamps leveraging the attention mechanism.Extensive experiments demonstrate the effectiveness of each component in MVTrajecter and show that it outperforms the previous state-of-the-art methods.
Poster
Youliang Zhang · Ronghui Li · Yachao Zhang · Liang Pan · Jingbo Wang · Yebin Liu · Xiu Li

[ Exhibit Hall I ]

Abstract
Extracting physically plausible 3D human motion from videos is a critical task. Although existing simulation-based motion imitation methods can enhance the physical quality of daily motions estimated from monocular video capture, extending this capability to high-difficulty motions remains an open challenge. This can be attributed to some flawed motion clips in video-based motion capture results and the inherent complexity in modeling high-difficulty motions. Therefore, sensing the advantage of segmentation in localizing human body, we introduce a mask-based motion correction module (MCM) that leverages motion context and video mask to repair flawed motions; and propose a physics-based motion transfer module (PTM), which employs a prior injected pretrain and adapt approach for motion imitation, improving physical plausibility with the ability to handle in-the-wild and challenging motions. Our approach is designed as a plug-and-play module to physically refine the video motion capture, which also excels in motion generation tasks. Finally, we collected a challenging in-the-wild test set to establish a benchmark, and our method has demonstrated effectiveness on both the new benchmark and existing public datasets. Our project page is : https://physicalmotionrestoration.github.io/
Poster
Xiaobao Wei · Peng Chen · Guangyu Li · Ming Lu · Hui Chen · Feng Tian

[ Exhibit Hall I ]

Abstract
Gaze estimation encounters generalization challenges when dealing with out-of-distribution data. To address this problem, recent methods use neural radiance fields (NeRF) to generate augmented data. However, existing methods based on NeRF are computationally expensive and lack facial details. 3D Gaussian Splatting (3DGS) has become the prevailing representation of neural fields. While 3DGS has been extensively examined in head avatars, it faces challenges with accurate gaze control and generalization across different subjects. In this work, we propose GazeGaussian, the first high-fidelity gaze redirection method that uses a two-stream 3DGS model to represent the face and eye regions separately. Leveraging the unstructured nature of 3DGS, we develop a novel representation of the eye for rigid eye rotation based on the target gaze direction. To enable synthesis generalization across various subjects, we integrate an expression-guided module to inject subject-specific information into the neural renderer. Comprehensive experiments show that GazeGaussian outperforms existing methods in rendering speed, gaze redirection accuracy, and facial synthesis across multiple datasets. The code will be released.
Poster
Zijian Dong · Longteng Duan · Jie Song · Michael Black · Andreas Geiger

[ Exhibit Hall I ]

Abstract
We present MoGA, a novel method to reconstruct high-fidelity 3D Gaussian avatars from a single-view image. The main challenge lies in inferring unseen appearance and geometric details while ensuring 3D consistency and realism. Most previous methods rely on 2D diffusion models to synthesize unseen views; however, these generated views are sparse and inconsistent, resulting in unrealistic 3D artifacts and blurred appearance. To address these limitations, we leverage a generative avatar model, that can generate diverse 3D avatars by sampling deformed Gaussians from a learned prior distribution. Due to the limited amount of 3D training data, such a model alone cannot capture all image details of unseen identities. Consequently, we integrate it as a prior, ensuring 3D consistency by projecting input images into its latent space and enforcing additional 3D appearance and geometric constraints. Our novel approach formulates Gaussian avatar creation as a model inversion process by fitting the generative avatar to synthetic views from 2D diffusion models. The generative avatar provides a meaningful initialization for model fitting, enforces 3D regularization, and helps in refining pose estimation. Experiments show that our method surpasses state-of-the-art techniques and generalizes well to real-world scenarios. Our Gaussian avatars are also inherently animatable
Poster
Yujie Zhou · Jiazi Bu · Pengyang Ling · Pan Zhang · Tong Wu · Qidong Huang · Jinsong Li · Xiaoyi Dong · Yuhang Zang · Yuhang Cao · Anyi Rao · Jiaqi Wang · Li Niu

[ Exhibit Hall I ]

Abstract
Recent advancements in image relighting models, driven by large-scale datasets and pre-trained diffusion models, have enabled the imposition of consistent lighting. However, video relighting still lags, primarily due to the excessive training costs and the scarcity of diverse, high-quality video relighting datasets. A simple application of image relighting models on a frame-by-frame basis leads to several issues: lighting source inconsistency and relighted appearance inconsistency, resulting in flickers in the generated videos. In this work, we propose Light-A-Video, a training-free approach to achieve temporally smooth video relighting. Adapted from image relighting models, Light-A-Video introduces two key techniques to enhance lighting consistency. First, we design a Consistent Light Attention (CLA) module, which enhances cross-frame interactions within the self-attention layers of the image relight model to stabilize the generation of the background lighting source. Second, leveraging the physical principle of light transport independence, we apply linear blending between the source video’s appearance and the relighted appearance, using a Progressive Light Fusion (PLF) strategy to ensure smooth temporal transitions in illumination. Experiments show that Light-A-Video improves the temporal consistency of relighted video while maintaining the relighted image quality, ensuring coherent lighting transitions across frames.
Poster
Heyi Sun · Cong Wang · Tian-Xing Xu · Jingwei Huang · Di Kang · Chunchao Guo · Song-Hai Zhang

[ Exhibit Hall I ]

Abstract
Creating high-fidelity and editable head avatars is a pivotal challenge in computer vision and graphics, boosting many AR/VR applications. While recent advancements have achieved photorealistic renderings and plausible animation, head editing, especially real-time appearance editing, remains challenging due to the implicit representation and entangled modeling of the geometry and global appearance. To address this, we propose Surface-Volumetric Gaussian Head Avatar (SVG-Head), a novel hybrid representation that explicitly models the geometry with 3D Gaussians bound on a FLAME mesh and leverages disentangled texture images to capture the global appearance. Technically, it contains two types of Gaussians, in which surface Gaussians explicitly model the appearance of head avatars using learnable texture images, facilitating real-time texture editing, while volumetric Gaussians enhance the reconstruction quality of non-Lambertian regions (e.g., lips and hair). To model the correspondence between 3D world and texture space, we provide a mesh-aware Gaussian UV mapping method, which leverages UV coordinates given by the FLAME mesh to obtain sharp texture images and real-time rendering speed. A hierarchical optimization strategy is further designed to pursue the optimal performance in both reconstruction quality and editing flexibility. Experiments on the NeRSemble dataset show that SVG-Head not only generates high-fidelity rendering results, but also is …
Poster
Ke Fan · Shunlin Lu · Minyue Dai · Runyi Yu · Lixing Xiao · Zhiyang Dou · Junting Dong · Lizhuang Ma · Jingbo Wang

[ Exhibit Hall I ]

Abstract
Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training datasets. Moreover, the lack of a comprehensive evaluation framework impedes the advancement of this task by failing to identify directions for improvement. In this work, we aim to push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot. To this end, firstly, we develop an efficient annotation pipeline and introduce MotionMillion—the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences. Additionally, we propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation. Leveraging a scalable architecture, we scale our model to 7B parameters and validate its performance on MotionMillion-Eval. Our results demonstrate strong generalization to out-of-domain and complex compositional motions, marking a significant step toward zero-shot human motion generation.
Poster
Ziyu Guo · Young-Yoon Lee · Joseph Liu · Yizhak Ben-Shabat · Victor Zordan · Mubbasir Kapadia

[ Exhibit Hall I ]

Abstract
We present **SᴛʏʟᴇMᴏᴛɪғ**, a novel Stylized Motion Latent Diffusion model, generating motion conditioned on both content and style from multiple modalities. Unlike existing approaches that either focus on generating diverse motion content or transferring style from sequences, SᴛʏʟᴇMᴏᴛɪғ seamlessly synthesizes motion across a wide range of content while incorporating stylistic cues from *multi-modal* inputs, including motion, text, image, video, and audio. To achieve this, we introduce a style-content cross fusion mechanism and align a style encoder with a pre-trained multi-modal model, ensuring that the generated motion accurately captures the reference style while preserving realism. Extensive experiments demonstrate that our framework surpasses existing methods in stylized motion generation and exhibits emergent capabilities for multi-modal motion stylization, enabling more nuanced motion synthesis. Source code and pre-trained models will be released upon acceptance.
Poster
Zhiyuan Zhang · Dongdong Chen · Jing Liao

[ Exhibit Hall I ]

Abstract
We present I2V3D, a novel framework for animating static images into dynamic videos with precise 3D control, leveraging the strengths of both 3D geometry guidance and advanced generative models. Our approach combines the precision of a computer graphics pipeline, enabling accurate control over elements such as camera movement, object rotation, and character animation, with the visual fidelity of generative AI to produce high-quality videos from coarsely rendered inputs. To support animations with any initial start point and extended sequences, we adopt a two-stage generation process guided by 3D geometry: 1) 3D-Guided Keyframe Generation, where a customized image diffusion model refines rendered keyframes to ensure consistency and quality, and 2) 3D-Guided Video Interpolation, a training-free approach that generates smooth, high-quality video frames between keyframes using bidirectional guidance. Experimental results highlight the effectiveness of our framework in producing controllable, high-quality animations from single input images by harmonizing 3D geometry with generative models. The code for our framework will be publicly released.
Poster
Seungjin Jung · Kanghee Lee · Yonghyun Jeong · Haeun Noh · Jungmin Lee · Jongwon Choi

[ Exhibit Hall I ]

Abstract
Domain Generalizable Face Anti-Spoofing (DG-FAS) methods effectively capture domain-invariant features by aligning the directions (weights) of local decision boundaries across domains. However, the bias terms associated with these boundaries remain misaligned, leading to inconsistent classification thresholds and degraded performance on unseen target domains.To address this issue, we propose a novel DG-FAS framework that jointly aligns weights and biases through Feature Orthogonal Decomposition (FOD) and Group-wise Scaling Risk Minimization (GS-RM).Specifically, GS-RM facilitates bias alignment by balancing group-wise losses across multiple domains. FOD employs the Gram-Schmidt orthogonalization process to decompose the feature space explicitly into domain-invariant and domain-specific subspaces. By enforcing orthogonality between domain-specific and domain-invariant features during training using domain labels, FOD ensures effective weight alignment across domains without negatively impacting bias alignment.Additionally, we introduce Expected Calibration Error (ECE) as a novel evaluation metric for quantitatively assessing the effectiveness of our method in aligning bias terms across domains. Extensive experiments on benchmark datasets demonstrate that our approach achieves state-of-the-art performance, consistently improving accuracy, reducing bias misalignment, and enhancing generalization stability on unseen target domains.
Poster
Zhaolun Li · Jichang Li · Yinqi Cai · Junye Chen · Xiaonan Luo · Guanbin Li · Rushi Lan

[ Exhibit Hall I ]

Abstract
In this paper, we propose FakeRadar, a novel deepfake video detection framework designed to address the challenges of cross-domain generalization in real-world scenarios. Existing detection methods typically rely on manipulation-specific cues, performing well on known forgery types but exhibiting severe limitations against emerging manipulation techniques. This poor generalization stems from their inability to adapt effectively to unseen forgery patterns. To overcome this, we leverage large-scale pretrained models (e.g. CLIP) to proactively probe the feature space, explicitly highlighting distributional gaps between real videos, known forgeries, and unseen manipulations. Specifically, FakeRadar introduces Forgery Outlier Probing, which employs dynamic subcluster modeling and cluster-conditional outlier generation to synthesize outlier samples near boundaries of estimated subclusters, simulating novel forgery artifacts beyond known manipulation types. Additionally, we design Outlier-Guided Tri-Training, which optimizes the detector to distinguish real, fake, and outlier samples using proposed outlier-driven contrastive learning and outlier-conditioned cross-entropy losses. Experiments show that FakeRadar outperforms existing methods across various benchmark datasets for deepfake video detection, particularly in cross-domain evaluations, by handling the variety of emerging manipulation techniques.
Poster
Xiaokun Sun · Zeyu Cai · Ying Tai · Jian Yang · Zhenyu Zhang

[ Exhibit Hall I ]

Abstract
While haircut indicates distinct personality, existing avatar generation methods fail to model practical hair due to the data limitation or entangled representation. We propose StrandHead, a novel text-driven method capable of generating 3D hair strands and disentangled head avatars with strand-level attributes. Instead of using large-scale hair-text paired data for supervision, we demonstrate that realistic hair strands can be generated from prompts by distilling 2D generative models pre-trained on human mesh data. To this end, we propose a meshing approach guided by strand geometry to guarantee the gradient flow from the distillation objective to the neural strand representation. The optimization is then regularized by statistically significant haircut features, leading to stable updating of strands against unreasonable drifting. These employed 2D/3D human-centric priors contribute to text-aligned and realistic 3D strand generation. Extensive experiments show that StrandHead achieves the state-of-the-art performance on text to strand generation and disentangled 3D head avatar modeling. The generated 3D hair can be applied on avatars for strand-level editing, as well as implemented in the graphics engine for physical simulation or other applications.
Poster
Ruining Li · Chuanxia Zheng · Christian Rupprecht · Andrea Vedaldi

[ Exhibit Hall I ]

Abstract
We present Puppet-Master, a video generator designed to capture the internal, part-level motion dynamics of objects as a proxy to understand object dynamics universally.Given an image of an object and a set of “drags” specifying the trajectory of a few points of the object, Puppet-Master synthesizes a video where the object parts move accordingly.We extend a pre-trained image-to-video generator with a module that encodes the input drags, and introduce all-to-first attention, a novel alternative to conventional spatial attention that mitigates artifacts caused by fine-tuning a video generator on out-of-domain data.Instead of using real videos, which often intertwine part-level motion with overall object motion, camera movement, and occlusion, we fine-tune Puppet-Master on Objaverse-Animation-HQ, a new dataset of curated part-level motion clips obtained by rendering synthetic 3D animations.We extensively filter out sub-optimal animations and augment the synthetic renderings with meaningful drags to emphasize the internal dynamics of objects.We demonstrate that by using this synthetic dataset, Puppet-Master learns to generate part-level motions, unlike other motion-conditioned video generators that mostly move the object as a whole, and generalizes well to real images, outperforming existing methods on real-world benchmarks in a zero-shot manner.
Poster
Hao He · Ceyuan Yang · Shanchuan Lin · Yinghao Xu · Meng Wei · Liangke Gui · Qi Zhao · Gordon Wetzstein · Lu Jiang · Hongsheng Li

[ Exhibit Hall I ]

Abstract
This paper introduces CameraCtrl II, a framework that enables continuous and dynamic scene exploration through a camera-controlled video diffusion model. Previous camera-conditioned video generative models suffer from diminished video dynamics and limited range of viewpoints when generating videos with large camera motion. We take an approach that progressively expands the generation of dynamic scenes---first enhancing dynamic content within individual clips, then extending these capabilities to create seamless explorations across broad viewpoint ranges. Specifically, we construct a dataset featuring a large degree of dynamics with camera annotation for training while designing a lightweight camera injection module and training scheme to enhance dynamics from pretrained models. Building on these improved single-clip capabilities, we enable extended scene exploration by allowing users to iteratively specify camera trajectories for generating coherent video sequences. Experiments across diverse scenarios demonstrate that CameraCtrl II enables dynamic scene synthesis with substantially wider spatial exploration and enhanced dynamics than previous approaches. We will release the dataset and code.
Poster
Lingyi Hong · Jinglun Li · Xinyu Zhou · Shilin Yan · Pinxue Guo · Kaixun Jiang · Zhaoyu Chen · Shuyong Gao · Runze Li · Xingdong Sheng · Wei Zhang · Hong Lu · Wenqiang Zhang

[ Exhibit Hall I ]

Abstract
Previous works have attempted to improve tracking efficiency through lightweight architecture design or knowledge distillation from teacher models to compact student trackers. However, these solutions often sacrifice accuracy for speed to a great extent, and also have the problems of complex training process and structural limitations. Thus, we propose a general model compression framework for efficient transformer object tracking, named CompressTracker, to reduce model size while preserving tracking accuracy. Our approach features a novel stage division strategy that segments the transformer layers of the teacher model into distinct stages to break the limitation of model structure. Additionally, we also design a unique replacement training technique that randomly substitutes specific stages in the student model with those from the teacher model, as opposed to training the student model in isolation. Replacement training enhances the student model's ability to replicate the teacher model's behavior and simplifies the training process. To further forcing student model to emulate teacher model, we incorporate prediction guidance and stage-wise feature mimicking to provide additional supervision during the teacher model's compression process. Our framework CompressTracker is structurally agnostic, making it compatible with any transformer architecture. We conduct a series of experiment to verify the effectiveness and generalizability of …
Poster
Runqi Wang · Yang Chen · Sijie Xu · Tianyao He · Wei Zhu · Dejia Song · Nemo Chen · Xu Tang · Yao Hu

[ Exhibit Hall I ]

Abstract
Face swapping transfers the identity of a source face to a target face while retaining the attributes like expression, pose, hair, and background of the target face. Advanced face swapping methods have achieved attractive results. However, these methods often inadvertently transfer identity information from the target face, compromising expression-related details and accurate identity. We propose a novel method DynamicFace that leverages the power of diffusion models and plug-and-play adaptive attention layers for image and video face swapping. First, we introduce four fine-grained facial conditions using 3D facial priors. All conditions are designed to be disentangled from each other for precise and unique control. Then, we adopt Face Former and ReferenceNet for high-level and detailed identity injection. Through experiments on the FF++ dataset, we demonstrate that our method achieves state-of-the-art results in face swapping, showcasing superior image quality, identity preservation, and expression accuracy. Our framework seamlessly adapts to both image and video domains. Our code and results will be available on the project page: https://dynamic-face.github.io/.
Poster
Xin Ding · Hao Wu · Yifan Yang · Shiqi Jiang · Qianxi Zhang · Donglin Bai · Zhibo Chen · Ting Cao

[ Exhibit Hall I ]

Abstract
With the rise of real-world human-AI interaction applications, such as AI assistants, the need for Streaming Video Dialogue is critical. To address this need, we introduce StreamMind, a video LLM framework that achieves ultra-FPS streaming video processing (100 fps on a single A100) and enables proactive, always-on responses in real time, without explicit user intervention. To solve the key challenge of the contradiction between linear video streaming speed and quadratic transformer computation cost, we propose a novel perception-cognition interleaving paradigm named ''event-gated LLM invocation'', in contrast to the existing per-time-step LLM invocation. By introducing a Cognition Gate network between the video encoder and the LLM, LLM is only invoked when relevant events occur. To realize the event feature extraction with constant cost, we propose Event-Preserving Feature Extractor (EPFE) based on state-space method, generating a single perception token for spatiotemporal features. These techniques enable the video LLM with full-FPS perception and real-time cognition response. Experiments on Ego4D and SoccerNet streaming tasks, as well as standard offline benchmarks, demonstrate state-of-the-art performance in both model capability and real-time efficiency, paving the way for ultra-high-FPS applications, such as Game AI Copilot and interactive media.
Poster
Zhefei Gong · Pengxiang Ding · Shangke Lyu · Siteng Huang · Mingyang Sun · Wei Zhao · Zhaoxin Fan · Donglin Wang

[ Exhibit Hall I ]

Abstract
In robotic visuomotor policy learning, diffusion-based models have achieved significant success in improving the accuracy of action trajectory generation compared to traditional autoregressive models. However, they suffer from inefficiency due to multiple denoising steps and limited flexibility from complex constraints. In this paper, we introduce **C**oarse-to-**F**ine **A**uto**R**egressive Policy (**CARP**), a novel paradigm for visuomotor policy learning that redefines the autoregressive action generation process as a coarse-tofine, next-scale approach. CARP decouples action generation into two stages: first, an action autoencoder learns multi-scale representations of the entire action sequence; then, a GPT-style transformer refines the sequence prediction through a coarse-to-fine autoregressive process. This straightforward and intuitive approach produces highly accurate and smooth actions, matching or even surpassing the performance of diffusion-based policies while maintaining efficiency on par with autoregressive policies. We conduct extensive evaluations across diverse settings, including single-task and multi-task scenarios on state-based and image-based simulation benchmarks, as well as real-world tasks. CARP achieves competitive success rates, with up to a 10% improvement, and delivers **10×** faster inference compared to state of-the-art policies, establishing a high-performance, efficient, and flexible paradigm for action generation in robotic tasks.
Poster
Yunqi Miao · Zhiyu Qu · Mingqi Gao · Changrui Chen · Jifei Song · Jungong Han · Jiankang Deng

[ Exhibit Hall I ]

Abstract
Although diffusion prior is rising as a powerful solution for blind face restoration (BFR), the inherent gap between the vanilla diffusion model and BFR settings hinders its seamless adaptation. The gap mainly stems from the discrepancy between 1) high-quality (HQ) and low-quality (LQ) images and 2) synthesized and real-world images.The vanilla diffusion model is trained on images with no or less degredations, while BFR handles moderately to severely degraded images.Additionally, LQ images used for training are synthesized by a naive degradation model with limited degradation patterns, which fails to simulate the complex and unknown degradations in real-world scenarios.In this work, we use a unified network FLIPNET that switches between two modes to address specific gaps.In restoration mode, the model gradually integrates BFR-oriented features and face embeddings from LQ images to achieve authentic and faithful face restoration.In degradation mode, the model synthesizes real-world like degraded images based on the knowledge learned from real-world degradation datasets.Extensive evaluations on benchmark datasets show that our model 1) outperforms previous diffusion prior based BFR methods in terms of authenticity and fidelity, and 2) outperforms the naive degradation model in modeling the real-world degradations.
Poster
Yi-Ting Chen · Ting-Hsuan Liao · Pengsheng Guo · Alex Schwing · Jia-Bin Huang

[ Exhibit Hall I ]

Abstract
We propose 3D Super Resolution (3DSR), a novel 3D Gaussian-splatting-based super-resolution framework that leverages off-the-shelf diffusion-based 2D super-resolution models. 3DSR encourages 3D consistency across views via the use of an explicit unifying 3D Gaussian-splatting-based scene representation. This makes the proposed 3DSR different from prior work, such as image upsampling or the use of video super-resolution, which either don't consider 3D consistency or aim to incorporate 3D consistency implicitly. Notably, our method enhances visual quality without additional fine-tuning, ensuring spatial coherence within the reconstructed scene. We evaluate 3DSR on MipNeRF360 and LLFF data, demonstrating that it produces high-resolution results that are visually compelling while maintaining structural consistency in 3D reconstructions. Code will be released.
Poster
Rongtao Xu · Jian Zhang · Minghao Guo · Youpeng Wen · Haoting Yang · Min Lin · Jianzheng Huang · Zhe Li · Kaidong Zhang · Liqiong Wang · Yuxuan Kuang · Meng Cao · Feng Zheng · Xiaodan Liang

[ Exhibit Hall I ]

Abstract
Robotic manipulation faces critical challenges in understanding spatial affordances—the "where" and "how" of object interactions—essential for complex manipulation tasks like wiping a board or stacking objects. Existing methods, including modular-based and end-to-end approaches, often lack robust spatial reasoning capabilities. Unlike recent point-based and flow-based affordance methods that focus on dense spatial representations or trajectory modeling, we propose A₀, a hierarchical affordance-aware diffusion model that decomposes manipulation task into high-level spatial affordance understanding and low-level action execution. A₀ leverages the Embodiment-Agnostic Affordance Representation, which captures object-centric spatial affordances by predicting contact point and post-contact trajectories. A₀ is pre-trained on 1 million contact points data and fine-tuned on annotated trajectories, enabling generalization across platforms. Key components include Position Offset Attention for motion-aware feature extraction and a Spatial Information Aggregation Layer for precise coordinate mapping. The model’s output is executed by the action execution module. Experiments on multiple robotic systems (Franka, Kinova, Realman and Dobot) demonstrate A₀'s superior performance in complex tasks, showcasing its efficiency, flexibility, and real-world applicability.
Poster
Shehreen Azad · Yogesh Rawat

[ Exhibit Hall I ]

Abstract
In this work, we address activity-biometrics, which involves identifying individuals across diverse set of activities. Unlike traditional person identification, this setting introduces additional challenges as identity cues become entangled with motion dynamics and appearance variations, making biometrics feature learning more complex. While additional visual data like pose and/or silhouette help, they often struggle from extraction inaccuracies. To overcome this, we propose a multimodal language-guided framework that replaces reliance on additional visual data with structured textual supervision. At its core, we introduce **DisenQ** (**Disen**tangling **Q**-Former), a unified querying transformer that disentangles biometrics, motion, and non-biometrics features by leveraging structured language guidance. This ensures identity cues remain independent of appearance and motion variations, preventing misidentifications. We evaluate our approach on three activity-based video benchmarks, achieving state-of-the-art performance. Additionally, we demonstrate strong generalization to complex real-world scenario with competitive performance on a traditional video-based identification benchmark, showing the effectiveness of our framework.
Poster
Song Wang · Xie Han · Liqun Kuang · Boying Wang · Zhongyu Chen · Zherui Qiao · Fan Yang · Xiaoxia Liu · Bingyu Zhang · Zhixun Wang

[ Exhibit Hall I ]

Abstract
Infrared and visible image fusion (IVF) aims to generate informative fused images by combining the merits of different modalities. In this paper, we uncover the inherent "attention properties" of infrared images, which directly arise from their physical characteristics and can be linked to attention mechanisms naturally, as observed in the gradient-weighted class activation mapping (Grad-CAM) visualization results of image classification models. To incorporate this property into IVF for better fusion, we propose the source infrared cross attention (I-SCA). Furthermore, we extend this discovery to visible images and introduce the source visible cross attention (V-SCA). The joint use of I-SCA and V-SCA addresses longstanding issues in image fusion, such as insufficient and incomplete multimodal feature interaction and fusion. Moreover, to solve the problem of mismatched channel numbers between the source images and intermediate features, which makes it impossible to apply the attention equation directly, and to minimize the domain gap between their respective feature spaces, an adaptive channel boosting and intelligent space mapping module (CBSM) is introduced. Specifically, we treat the CBSM-processed raw image as the query, while the intermediate features of another modality are treated as keys and values in I-SCA and V-SCA. Unlike attention mechanisms that divide images into …
Poster
Zimin Ran · Xingyu Ren · Xiang An · Kaicheng Yang · Ziyong Feng · Jing Yang · Rolandos Alexandros Potamias · Linchao Zhu · Jiankang Deng

[ Exhibit Hall I ]

Abstract
Recent 3D facial reconstruction methods have made significant progress in shape estimation, but high-fidelity unbiased facial albedo estimation remains challenging. Existing methods rely on expensive light-stage captured data, and while they have made progress in either high-fidelity reconstruction or unbiased skin tone estimation, no work has yet achieved optimal results in both aspects simultaneously. In this paper, we present a novel high-fidelity unbiased facial diffuse albedo reconstruction method, HUST, which recovers the diffuse albedo map directly from a single image without the need for captured data. Our key insight is that the albedo map is the illumination-invariant texture map, which enables us to use inexpensive texture data for diffuse albedo estimation by eliminating illumination. To achieve this, we collect large-scale high-resolution facial images and train a VQGAN model in the image space. To adapt the pre-trained VQGAN model for UV texture generation, we fine-tune the encoder by using limited UV textures and our high-resolution faces under adversarial supervision in both image and latent space. Finally, we train a cross-attention module and utilize group identity loss for the domain adaptation from texture to albedo. Extensive experiments demonstrate that HUST can predict high-fidelity facial albedos for in-the-wild images. On the FAIR benchmark, …
Poster
Yangyi Huang · Ye Yuan · Xueting Li · Jan Kautz · Umar Iqbal

[ Exhibit Hall I ]

Abstract
Existing methods for image-to-3D avatar generation struggle to produce highly detailed, animation-ready avatars suitable for real-world applications. We introduce AdaHuman, a novel framework that generates high-fidelity animatable 3D avatars from a single in-the-wild image. AdaHuman incorporates two key innovations: (1) A pose-conditioned 3D joint diffusion model that synthesizes consistent multi-view images in arbitrary poses alongside corresponding 3D Gaussian Splats (3DGS) reconstruction at each diffusion step; (2) A compositional 3DGS refinement module that enhances the details of local body parts through image-to-image refinement and seamlessly integrates them using a novel crop-aware camera ray map, producing a cohesive detailed 3D avatar. These components allow AdaHuman to generate highly realistic standardized A-pose avatars with minimal self-occlusion, enabling rigging and animation with any input motion. Extensive evaluation on public benchmarks and in-the-wild images demonstrates that AdaHuman significantly outperforms state-of-the-art methods in both avatar reconstruction and reposing. Code and models will be publicly available for research purposes.
Poster
Pulkit Kumar · Shuaiyi Huang · Matthew Walmer · Sai Saketh Rambhatla · Abhinav Shrivastava

[ Exhibit Hall I ]

Abstract
Video understanding requires effective modeling of both motion and appearance information, particularly for few-shot action recognition. While recent advances in point tracking have been shown to improve few-shot action recognition, two fundamental challenges persist: selecting informative points to track and effectively modeling their motion patterns. We present Trokens, a novel approach that transforms trajectory points into semantic-aware relational tokens for action recognition. First, we introduce a semantic-aware sampling strategy to adaptively distribute tracking points based on object scale and semantic relevance. Second, we develop a motion modeling framework that captures both intra-trajectory dynamics through the Histogram of Oriented Displacements (HoD) and inter-trajectory relationships to model complex action patterns. Our approach effectively combines these trajectory tokens with semantic features to enhance appearance features with motion information, achieving state-of-the-art performance across six diverse few-shot action recognition benchmarks: Something-Something-V2 (both full and small splits), Kinetics, UCF101, HMDB51, and FineGym.
Poster
zijie wu · Chaohui Yu · Fan Wang · Xiang Bai

[ Exhibit Hall I ]

Abstract
Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data. In this paper, we present AnimateAnyMesh, the first feed-forward framework that enables efficient text-driven animation of arbitrary 3D meshes. Our approach leverages a novel DyMeshVAE architecture that effectively compresses and reconstructs dynamic mesh sequences by disentangling spatial and temporal features while preserving local topological structures. To enable high-quality text-conditional generation, we employ a Rectified Flow-based training strategy in the compressed latent space. Additionally, we contribute the DyMesh Dataset, containing over 4M diverse dynamic mesh sequences with text annotations. Experimental results demonstrate that our method generates semantically accurate and temporally coherent mesh animations in a few seconds, significantly outperforming existing approaches in both quality and efficiency. Our work marks a substantial step forward in making 4D content creation more accessible and practical. All the data, code, and models will be open-released.
Poster
Kanggeon Lee · Soochahn Lee · Kyoung Mu Lee

[ Exhibit Hall I ]

Abstract
Existing methods for image alignment struggle in cases involving feature-sparse regions, extreme scale and field-of-view differences, and large deformations, often resulting in suboptimal accuracy.Robustness to these challenges improves through iterative refinement of the transformation field while focusing on critical regions in multi-scale image representations.We thus propose Auto-Regressive Transformation (ART), a novel method that iteratively estimates the coarse-to-fine transformations within an auto-regressive framework. Leveraging hierarchical multi-scale features, our network refines the transformations using randomly sampled points at each scale.By incorporating guidance from the cross-attention layer, the model focuses on critical regions, ensuring accurate alignment even in challenging, feature-limited conditions.Extensive experiments across diverse datasets demonstrate that ART significantly outperforms state-of-the-art methods, establishing it as a powerful new method for precise image alignment with broad applicability.
Poster
Chih-Hao Lin · Zian Wang · Ruofan Liang · Yuxuan Zhang · Sanja Fidler · Shenlong Wang · Zan Gojcic

[ Exhibit Hall I ]

Abstract
Generating realistic and controllable weather effects in videos is valuable for many applications. Physics-based weather simulation requires precise reconstructions that are hard to scale to in-the-wild videos, while current video editing often lacks realism and control.In this work, we introduce WeatherWeaver, a video diffusion model that synthesizes diverse weather effects---including rain, snow, fog, and clouds---directly into any input video without the need for 3D modeling.Our model provides precise control over weather effect intensity and supports blending various weather types, ensuring both realism and adaptability.To overcome the scarcity of paired training data, we propose a novel data strategy combining synthetic videos, generative image editing, and auto-labeled real-world videos. Extensive evaluations show that our method outperforms state-of-the-art methodsin weather simulation and removal, providing high-quality, physically plausible, and scene-identity-preserving results over various real-world videos.
Poster
Wangze Xu · Yifan Zhan · Zhihang Zhong · Xiao Sun

[ Exhibit Hall I ]

Abstract
The emergence of neural rendering has significantly advanced the rendering quality of 3D human avatars, with the recently popular 3DGS technique enabling real-time performance. However, SMPL-driven 3DGS human avatars still struggle to capture fine appearance details due to the complex mapping from pose to appearance during fitting. In this paper, we excavate the explicit 3DGS representation to better model human avatars based on a hierarchical motion context. Specifically, we utilize a coarse-to-fine motion conditions that incorporate both overall human skeleton and fine-grained vertex motions for non-rigid deformation. To enhance the robustness of the proposed motion conditions, we adopt a spatio-temporal muli-scale sampling strategy to hierarchically integrate more motion clues to model human avatars. Extensive experiments demonstrate that our method significantly outperforms 3DGS-based approaches and renders human avatars orders of magnitude faster than the latest NeRF-based models that incorporate temporal context, all while delivering performance that is at least comparable or even superior.
Poster
Yinda Chen · Haoyuan Shi · Xiaoyu Liu · Te Shi · Ruobing Zhang · Dong Liu · Zhiwei Xiong · Feng Wu

[ Exhibit Hall I ]

Abstract
Neuron segmentation from electron microscopy (EM) volumes is crucial for understanding brain circuits, yet the complex neuronal structures in high-resolution EM images present significant challenges. Inspired by autoregressive pretraining in language models, we propose TokenUnify, a hierarchical predictive coding framework that captures multi-scale dependencies through complementary learning objectives. TokenUnify integrates random token prediction, next-token prediction, and next-all token prediction to create a comprehensive representational space with emergent properties. From an information-theoretic perspective, these three tasks are complementary and provide optimal coverage of visual data structure. We also introduce a large-scale EM dataset with 1.2 billion annotated voxels, offering ideal long-sequence visual data with spatial continuity. Leveraging the Mamba architecture's linear-time sequence modeling capabilities, TokenUnify achieves a 45\% performance improvement on downstream neuron segmentation and outperforms MAE by 21\%. Our approach demonstrates superior scaling properties as model size increases, effectively bridging the gap between pretraining strategies for language and vision models.
Poster
Shuangrui Ding · Rui Qian · Xiaoyi Dong · Pan Zhang · Yuhang Zang · Yuhang Cao · Yuwei Guo · Dahua Lin · Jiaqi Wang

[ Exhibit Hall I ]

Abstract
The Segment Anything Model 2 (SAM 2) has emerged as a powerful foundation model for object segmentation in both images and videos. The crucial design of SAM 2 for video segmentation is its memory module, which prompts object-aware memories from previous frames for current frame prediction. However, its greedy-selection memory design suffers from the ``error accumulation" problem, where an errored or missed mask will cascade and influence the segmentation of the subsequent frames, which limits the performance of SAM 2 toward complex long-term videos. To this end, we introduce SAM2Long, an improved training-free video object segmentation strategy, which considers the segmentation uncertainty within each frame and chooses the video-level optimal results from multiple segmentation pathways in a constrained tree search manner. In practice, we maintain a fixed number of segmentation pathways throughout the video. For each frame, multiple masks are proposed based on the existing pathways, creating various candidate branches. We then select the same fixed number of branches with higher cumulative scores as the new pathways for the next frame. After processing the final frame, the pathway with the highest cumulative score is chosen as the final segmentation result. Benefiting from its heuristic search design, SAM2Long is robust toward …
Poster
Jiahao Luo · Chaoyang Wang · Michael Vasilkovsky · Vladislav Shakhrai · Di Liu · Peiye Zhuang · Sergey Tulyakov · Peter Wonka · Hsin-Ying Lee · James Davis · Jian Wang

[ Exhibit Hall I ]

Abstract
We propose a new framework to create high-quality character head morphable models from text, combining static text-to-3D generation with video diffusion. Bridging the gap between these two methods is challenging: text-to-3D models produce detailed static geometry but cannot synthesize motion, while video diffusion models generate motion but face consistency issues like varying colors, varying viewpoints, or geometric distortion. Our solution uses deformable 3D Gaussian splatting to align static 3D models with video diffusion outputs, enabling the creation of a set of diverse, expressive motions with greater accuracy. By incorporating static geometry as a constraint and using a view-dependent deformation MLP, we reduce video artifacts and produce coherent, consistent results. This approach allows us to build a 3D morphable model that can generate new, realistic expressions. Compared to existing 4D generation techniques, our method achieves superior results and creates expressive character head models that can be animated.
Poster
Ruijie Lu · Yixin Chen · Yu Liu · Jiaxiang Tang · Junfeng Ni · Diwen Wan · Gang Zeng · Siyuan Huang

[ Exhibit Hall I ]

Abstract
Humans can infer complete shapes and appearances of objects from limited visual cues, relying on extensive prior knowledge of the physical world. However, completing partially observable objects while ensuring consistency across video frames remains challenging for existing models, especially for unstructured, in-the-wild videos. This paper tackles the task of Video Amodal Completion (VAC), which aims to generate the complete object consistently throughout the video given a visual prompt specifying the object of interest. Leveraging the rich, consistent manifolds learned by pre-trained video diffusion models, we propose a conditional diffusion model, TACO, that repurposes these manifolds for VAC. To enable its effective and robust generalization to challenging in-the-wild scenarios, we curate a large-scale synthetic dataset with multiple difficulty levels by systematically imposing occlusions onto un-occluded videos. Building on this, we devise a progressive fine-tuning paradigm that starts with simpler recovery tasks and gradually advances to more complex ones. We demonstrate TACO's versatility on a wide range of in-the-wild videos from Internet, as well as on diverse, unseen datasets commonly used in autonomous driving, robotic manipulation, and scene understanding. Moreover, we show that TACO can be effectively applied to various downstream tasks like object reconstruction and pose estimation, highlighting its potential to …
Poster
Shijie Fang · Hongping Gan

[ Exhibit Hall I ]

Abstract
Deep Unfolding Networks (DUNs) have emerged as a powerful framework for pansharpening due to their interpretable fusion strategies. However, existing DUNs are limited by their serial iterative architectures, which hinder cross-stage and cross-modal feature interactions at different abstraction levels. This limitation results in insufficient integration of multi-level multimodal features and compromised reconstruction accuracy. To address these challenges, we propose the Unfolding-Associative Encoder-Decoder Network (UED-Net), an innovative framework that iteratively extracts multi-level cross-modal degradation encodings and recursively refines features for cross-stage adaptive aggregation decoding through lightweight processes. Specifically, we first introduce the spatial-spectral encoding module, which progressively and interpretably perceives the hierarchical degradation encoding features of both space and spectrum. Moreover, we develop the unfolding-associative attention module to capture pixel-level attention across stages, thereby leveraging the causal relationships of multi-level features for aggregation during decoding. Meanwhile, we implement a progressive alignment mechanism, which coordinates both feature distribution and alignment of spatial and spectral modalities between iterative stages to facilitate adaptive fusion. These modules enable UED-Net to achieve efficient pansharpening by aggregating multi-level features. Extensive qualitative and quantitative experiments confirm the superiority of UED-Net.
Poster
Guanxing Lu · Tengbo Yu · Haoyuan Deng · Season Chen · Yansong Tang · Ziwei Wang

[ Exhibit Hall I ]

Abstract
Performing general language-conditioned bimanual manipulation tasks is of great importance for many applications ranging from household service to industrial assembly. However, collecting bimanual manipulation data is expensive due to the high-dimensional action space, which poses challenges for conventional methods to handle general bimanual manipulation tasks. In contrast, unimanual policy has recently demonstrated impressive generalizability across a wide range of tasks because of scaled model parameters and training data, which can provide sharable manipulation knowledge for bimanual systems. To this end, we propose a plug-and-play method named **AnyBimanual**, which transfers pretrained unimanual policy to general bimanual manipulation policy with few bimanual demonstrations. Specifically, we first introduce a skill manager to dynamically schedule the skill representations discovered from pretrained unimanual policy for bimanual manipulation tasks, which linearly combines skill primitives with task-oriented compensation to represent the bimanual manipulation instruction. To mitigate the observation discrepancy between unimanual and bimanual systems, we present a visual aligner to generate soft masks for visual embedding of the workspace, which aims to align visual input of unimanual policy model for each arm with those during pretraining stage. AnyBimanual shows superiority on 12 simulated tasks from RLBench2 with a sizable 17.33% improvement in success rate over previous methods. …
Poster
Tao Wang · Peiwen Xia · Bo Li · Peng-Tao Jiang · Zhe Kong · Kaihao Zhang · Tong Lu · Wenhan Luo

[ Exhibit Hall I ]

Abstract
Adverse weather conditions, such as rain, snow, and haze, introduce complex degradations that present substantial challenges for effective image restoration. Existing all-in-one models often rely on fixed network structures, limiting their ability to adapt to the varying characteristics of different weather conditions. Moreover, these models typically lack the iterative refinement process that human experts use for progressive image restoration. In this work, we propose MOERL, a Mixture-of-Experts (MoE) model optimized with reinforcement learning (RL) to enhance image restoration across diverse weather conditions. Our method incorporates two core types of experts, i.e., channel-wise modulation and spatial modulation experts to address task-specific degradation characteristics while minimizing task interference. In addition, inspired by human expertise, we frame the optimization process as a sequential, progressive problem, allowing the network to refine its parameters progressively and adapt to specific weather conditions. Extensive experiments demonstrate the efficacy and superiority of our proposed method. The code and pre-trained models will be available.
Poster
Li Huaqiu · Yong Wang · Tongwen Huang · Hailang Huang · Haoqian Wang · Xiangxiang Chu

[ Exhibit Hall I ]

Abstract
Unified image restoration is a significantly challenging task in low-level vision. Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model. Our method incorporates the multimodal understanding model to provide sematic priors for the generative model under a task-blind condition. Furthermore, it utilizes a lightweight module to align the degraded input with the generated preference of the diffusion model, and employs recurrent refinement for posterior sampling. The proposed method enables zero-shot unified image restoration without the need for any prior knowledge of specific task types and degradation modeling. Extensive experiments demonstrate that our method outperforms state-of-the-art methods, validating its effectiveness and robustness. Our code and data will be made publicly available.
Poster
Wenqiang Sun · Shuo Chen · Fangfu Liu · Zilong Chen · Yueqi Duan · Jun Zhu · Jun Zhang · Yikai Wang

[ Exhibit Hall I ]

Abstract
In this paper, we introduce DimensionX, a framework designed to generate photorealistic 3D and 4D scenes from just a single image with video diffusion. Our approach begins with the insight that both the spatial structure of a 3D scene and the temporal evolution of a 4D scene can be effectively represented through sequences of video frames. While recent video diffusion models have shown remarkable success in producing vivid visuals, they face limitations in directly recovering 3D/4D scenes due to poor spatial and temporal controllability during generation. To overcome this difficulty, we propose ST-Director, which decouples spatial and temporal factors in video diffusion by learning dimension-aware directors from dimension-variant data. This decoupled video diffusion enables precise manipulation of spatial structures and temporal dynamics, allowing us to reconstruct both 3D and 4D representations from sequential frames by combining spatial and temporal dimensions. Additionally, to bridge the gap between generated videos and real-world scenes, we introduce a trajectory-aware mechanism for 3D generation and an identity-preserving denoising strategy for 4D generation, respectively. Extensive experiments on various real-world and synthetic datasets demonstrate that DimensionX achieves state-of-the-art performance in decoupled video generation, as well as 3D and 4D scene generation.
Poster
Feng yan · Fanfan Liu · Yiyang Huang · ZechaoGuan ZechaoGuan · Liming Zheng · Yufeng Zhong · Chengjian Feng · Lin Ma

[ Exhibit Hall I ]

Abstract
In recent years, robotics has advanced significantly through the integration of larger models and large-scale datasets. However, challenges remain in applying these models to 3D spatial interactions and managing data collection costs. To address these issues, we propose the multimodal robotic manipulation model, \textit{RoboMM}, along with the comprehensive dataset, \textit{RoboData}.\textit{RoboMM} enhances 3D perception through camera parameters and occupancy supervision. Building on OpenFlamingo, it incorporates Modality-Isolation-Mask and multimodal decoder blocks, improving modality fusion and fine-grained perception. \textit{RoboData} offers the complete evaluation system by integrating several well-known datasets, achieving the first fusion of multi-view images, camera parameters, depth maps, and actions, and the space alignment facilitates comprehensive learning from diverse robotic datasets.Equipped with \textit{RoboData} and the unified physical space, \textit{RoboMM} is the first generalist policy that surpasses expert models, enabling simultaneous evaluation of all tasks across multiple datasets, rather than being limited to specific data or task selections.Its design significantly enhances robotic manipulation performance, increasing the average sequence length on the CALVIN from 1.7 to 3.5 and ensuring cross-embodiment capabilities, achieving state-of-the-art results across multiple datasets, including both simulated and real-world data.
Poster
Seunghun Lee · Jiwan Seo · Minwoo Choi · Kiljoon Han · Jaehoon Jeong · Zane Durante · Ehsan Adeli · Sang Hyun Park · Sunghoon Im

[ Exhibit Hall I ]

Abstract
In this paper, we present Latest Object Memory Management (LOMM) for temporally consistent video instance segmentation that significantly improves long-term instance tracking. At the core of our method is Latest Object Memory (LOM), which robustly tracks and continuously updates the latest states of objects by explicitly modeling their presence in each frame. This enables consistent tracking and accurate identity management across frames, enhancing both performance and reliability through the VIS process. Moreover, we introduce Decoupled Object Association (DOA), a strategy that separately handles newly appearing and already existing objects. By leveraging our memory system, DOA accurately assigns object indices, improving matching accuracy and ensuring stable identity consistency, even in dynamic scenes where objects frequently appear and disappear. Extensive experiments and ablation studies demonstrate the superiority of our method over traditional approaches, setting a new benchmark in VIS. Notably, our LOMM achieves state-of-the-art AP score of 54.0 on YouTube-VIS 2022, a dataset known for its challenging long videos.
Poster
Haiyang Liu · Zhan Xu · Fating Hong · Hsin-Ping Huang · Yi Zhou · Yang Zhou

[ Exhibit Hall I ]

Abstract
We present Video Motion Graphs, a system designed to generate realistic human motion videos. Using a reference video and conditional signals such as music or motion tags, the system synthesizes new videos by first retrieving video clips with gestures matching the conditions and then generating interpolation frames to seamlessly connect clip boundaries. The core of our approach is HMInterp, a robust Video Frame Interpolation (VFI) model that enables seamless interpolation of discontinuous frames, even for complex motion scenarios like dancing. HMInterp i) employs a dual-branch interpolation approach, combining a Motion Diffusion Model for human skeleton motion interpolation with a diffusion-based video frame interpolation model for final frame generation. ii) adopts condition progressive training to effectively leverage identity strong and weak conditions, such as images and pose. These designs ensure both high video texture quality and accurate motion trajectory. Our Video Motion Graphs outperforms existing generative- and retrieval-based methods for human motion video generation. Our codes and pretrained models are public available.
Poster
Hyung Rok Jung · Daneul Kim · Seunggyun Lim · Jeany Son · Jonghyun Choi

[ Exhibit Hall I ]

Abstract
Generic Event Boundary Detection (GEBD) aims to interpret long-form videos through the lens of human perception. However, current GEBD methods rely on complete video frames for prediction, which contrasts with the human ability to process information online and in real time. To bridge this gap, we introduce a new task, Online Generic Event Boundary Detection (On-GEBD), which aims to detect boundaries of generic events immediately in streaming videos. This task faces unique challenges of identifying subtle, taxonomy-free event changes in real-time, without access to future frames. To tackle these challenges, we propose a novel On-GEBD framework, $\textit{ESTimator}$, inspired by Event Segmentation Theory (EST) which explains how humans segment ongoing activity into events by leveraging the discrepancies between predicted and actual information. Our framework consists of two key components: the Consistent Event Anticipator (CEA), and the Online Boundary Discriminator (OBD). Specifically, the CEA generates a prediction of the future frame reflecting current event dynamics based solely on prior frames. Then, the OBD computes the discrepancy between the prediction and the actual incoming frame, adaptively adjusting the error threshold using statistical tests on historical errors to capture diverse and subtle event transitions. Experimental results demonstrate that $ESTimator$ outperforms all baselines adapted from …
Poster
lijiayi jiayi

[ Exhibit Hall I ]

Abstract
Language-conditioned robot manipulation in the continuous spectrum presents a persistent challenge due to the difficult of mapping states to target actions. Previous methods face limitations in effectively modeling object states, primarily due to their reliance on executing ambiguous instructions devoid of explicit state information. In response, we present SD$^2$Actor, a zero-shot robotic manipulation framework that possesses the capability to generate precise actions in continuous states. Specifically, given the novel instructions, we aim to generate instruction-following and accurate robot manipulation actions. Instead of time-consuming optimization and finetuning, our zero-shot method generalizes to any object state with a wide range of translations and versatile rotations. At its core, we quantify multiple base states in the training set and utilize their combination to refine the target action generated by the diffusion model. To obtain novel state representations, we initially employ LLMs to extract the novel state from the instruction and decompose it into multiple learned base states. We then employ the linear combination of base state embeddings to produce novel state features. Moreover, we introduce the orthogonalization loss to constrain the state embedding space, which ensures the validity of linear interpolation. Experiments demonstrate that SD$^2$Actor outperforms state-of-the-art methods across a diverse range of …
Poster
Xiangyue Zhang · Jianfang Li · Jiaxu Zhang · Ziqiang Dang · Jianqiang Ren · Liefeng Bo · Zhigang Tu

[ Exhibit Hall I ]

Abstract
A good co-speech motion generation cannot be achieved without a careful integration of common rhythmic motion and rare yet essential semantic motion. In this work, we propose SemTalk for holistic co-speech motion generation with frame-level semantic emphasis. Our key insight is to separately learn base motions and sparse motions, and then adaptively fuse them. In particular, coarse2fine cross-attention module and rhythmic consistency learning are explored to establish rhythm-related base motion, ensuring a coherent foundation that synchronizes gestures with the speech rhythm. Subsequently, semantic emphasis learning is designed to generate semantic-aware sparse motion, focusing on frame-level semantic cues. Finally, to integrate sparse motion into the base motion and generate semantic-emphasized co-speech gestures, we further leverage a learned semantic score for adaptive synthesis. Qualitative and quantitative comparisons on two public datasets demonstrate that our method outperforms the state-of-the-art, delivering high-quality co-speech motion with enhanced semantic richness over a stable base motion.
Poster
Haitao Tian

[ Exhibit Hall I ]

Abstract
In this paper, a contrastive representation learning framework is proposed to enhance human action segmentation via pre-training using trimmed (single action) skeleton sequences. Unlike previous representation learning works that are tailored for action recognition and that build upon isolated sequence-wise representations, the proposed framework focuses on exploiting multi-scale representations in conjunction with cross-sequence variations. More specifically, it proposes a novel data augmentation strategy, “Shuffle and Warp”, which exploits diverse multi-action permutations. The latter effectively assists two surrogate tasks that are introduced in contrastive learning: Cross Permutation Contrasting (CPC) and Relative Order Reasoning (ROR). In optimization, CPC learns intra-class similarities by contrasting representations of the same action class across different permutations, while ROR reasons about inter-class contexts by predicting relative mapping between two permutations. Together, these tasks enable a Dual-Surrogate Contrastive Learning (DuoCLR) network to learn multi-scale feature representations optimized for action segmentation. In experiments, DuoCLR is pre-trained on a trimmed skeleton dataset and evaluated on an untrimmed dataset where it demonstrates a significant boost over state-the-art comparatives in both multi-class and multi-label action segmentation tasks. Lastly, ablation studies are conducted to evaluate the effectiveness of each component of the proposed approach.
Poster
Martin de La Gorce · Charlie Hewitt · Tibor Takács · Robert Gerdisch · Zafiirah Hosenie · Givi Meishvili · Marek Kowalski · Thomas J. Cashman · Antonio Criminisi

[ Exhibit Hall I ]

Abstract
Virtual 3D meetings offer the potential to enhance copresence, increase engagement and thus improve effectiveness of remote meetings compared to standard 2D video calls. However, representing people in 3D meetings remains a challenge; existing solutions achieve high quality by using complex hardware, making use of fixed appearance via enrolment, or by inverting a pre-trained generative model. These approaches lead to constraints that are unwelcome and ill-fitting for videoconferencing applications.We present the first method to predict 3D Gaussian reconstructions in real time from a single 2D webcam feed, where the 3D representation is not only live and realistic, but also authentic to the input video. By conditioning the 3D representation on each video frame independently, our reconstruction faithfully recreates the input video from the captured viewpoint (a property we call authenticity), while generalizing realistically to novel viewpoints. Additionally, we introduce a stability loss to obtain reconstructions that are temporally stable on video sequences.We show that our method delivers state-of-the-art accuracy in visual quality and stability metrics compared to existing methods, and demonstrate our approach in live one-to-one 3D meetings using only a standard 2D camera and display. This demonstrates that our approach can allow anyone to communicate volumetrically, via a method …
Poster
Zhijing Sun · Senyan Xu · Kean Liu · Runze Tian · Xueyang Fu · Zheng-Jun Zha

[ Exhibit Hall I ]

Abstract
Existing event-based video deblurring methods face limitations in extracting and fusing long-range spatiotemporal motion information from events, primarily due to restricted receptive fields or low computational efficiency, resulting in suboptimal deblurring performance.To address these issues, we introduce the state space model, which leverages linear complexity and global receptive fields for long-range modeling, and propose EVDM, a novel Event-based Video Deblurring framework with Mamba. The framework consists of: (1) Motion Clue Extraction Mamba (MCEM), which employs an event self-reconstruction loss to ensure the completeness of details when extracting long-range motion information. (2) Motion-aware Intra-frame Fusion Mamba (MIFM) and Inter-frame Temporal Propagation Mamba (ITPM), which utilize the motion-aware state space to perform cross-modal fusion and inter-frame information exchange guided by motion clues. Consequently, EVDM achieves superior detail restoration in blurred regions while ensuring temporal motion consistency across frames.Additionally, to overcome the limitation of fixed exposure ratios in existing event-frame paired datasets, we introduce T-RED, a high-quality, high-resolution dataset with varying exposure time ratios. T-RED provides more realistic and complex data for event-based video deblurring research.Experiments on multiple datasets demonstrate that EVDM outperforms previous SOTA methods.
Poster
Sen Wang · Shao Zeng · Tianjun Gu · zhizhong zhang · Ruixin Zhang · Shouhong Ding · Jingyun Zhang · Jun Wang · Xin TAN · Yuan Xie · Lizhuang Ma

[ Exhibit Hall I ]

Abstract
Low-level enhancement and high-level visual understanding in low-light vision have traditionally been treated separately. Low-light enhancement improves image quality for downstream tasks, but existing methods rely on physical or geometric priors, limiting generalization. Evaluation mainly focuses on visual quality rather than downstream performance. Low-light visual understanding, constrained by scarce labeled data, primarily uses task-specific domain adaptation, which lacks scalability. To address these challenges, we build a generalized bridge between low-light enhancement and low-light understanding, which we term Generalized Enhancement For Understanding (GEFU). This paradigm improves both generalization and scalability. To address the diverse causes of low-light degradation, we leverage pretrained generative diffusion models to optimize images, achieving zero-shot generalization performance. Building on this, we propose Semantically Consistent Unsupervised Fine-tuning (SCUF). Specifically, to overcome text prompt limitations, we introduce an illumination-aware image prompt to explicitly guide image generation and propose a cycle-attention adapter to maximize its semantic potential. To mitigate semantic degradation in unsupervised training, we propose caption and reflectance consistency to learn high-level semantics and image-level spatial semantics. Extensive experiments demonstrate that our proposed method outperforms current state-of-the-art methods in traditional image quality and GEFU tasks including classification, detection, and semantic segmentation.
Poster
Yuansheng Li · Yunhao Zou · Linwei Chen · Ying Fu

[ Exhibit Hall I ]

Abstract
Interferometric Hyperspectral Imaging (IHI) is a critical technique for large-scale remote sensing tasks due to its advantages in flux and spectral resolution. However, IHI is susceptible to complex errors arising from imaging steps, and its quality is limited by existing signal processing-based reconstruction algorithms. Two key challenges hinder performance enhancement: 1) the lack of training datasets. 2) the difficulty in eliminating IHI-specific degradation components through learning-based methods. To address these challenges, we propose a novel IHI reconstruction pipeline. First, based on imaging physics and radiometric calibration data, we establish a simplified yet accurate IHI degradation model and a parameter estimation method. This model enables the synthesis of realistic IHI training datasets from hyperspectral images (HSIs), bridging the gap between IHI reconstruction and deep learning. Second, we design the Interferometric Hyperspectral Reconstruction Unfolding Transformer (IHRUT), which achieves effective spectral correction and detail restoration through a stripe-pattern enhancement mechanism and a spatial-spectral transformer architecture. Experimental results demonstrate the superior performance and generalization capability of our method.
Poster
Ashutosh Anshul · Shreyas Gopal · Deepu Rajan · Eng Chng

[ Exhibit Hall I ]

Abstract
Recent deepfake detection algorithms focus solely on uni-modal or cross-modal inconsistencies. While the former disregards audio-visual correspondence entirely rendering them less effective against multimodal attacks, the latter overlooks inconsistencies in a particular modality. Moreover, many models are single-stage supervised frameworks, effective on specific training data but less generalizable to new manipulations. To address these gaps, we propose a two-stage multimodal framework that first learns intra-modal and cross-modal temporal synchronization on real videos, capturing audio-visual correspondences crucial for deepfake detection and localization. We introduce a Gaussian-targeted loss in our pretraining model to focus on learning relative synchronization patterns across multimodal pairs. Using pretrained features, our approach not only enables classification on fully manipulated videos but also supports a localization module for partial deepfakes with only specific segments spoofed. Moreover, the pretraining stage does not require fine-tuning, thus reducing complexity. Our model, tested on various benchmark datasets, demonstrates strong generalization and precise temporal localization.
Poster
Bizhu Wu · Jinheng Xie · Meidan Ding · Zhe Kong · Jianfeng Ren · Ruibin Bai · Rong Qu · Linlin Shen

[ Exhibit Hall I ]

Abstract
Generating realistic human motions from given textual descriptions has undergone significant advancements owing to the prevalence of digital humans. Although recent studies have achieved notable success in this task, they omitted specific body part movements and their timing.In this paper, we address this issue by enriching the textual description with more details. Specifically, we propose the FineMotion dataset, which contains over 442k human motion snippets, short segments of the human motion sequences, and their corresponding detailed human body part movement descriptions. Additionally, the dataset includes about 95k detailed paragraphs describing the movements of human body parts throughout entire motion sequences. Experimental results demonstrate the significance of our dataset on the text-driven fine-grained human motion generation task, especially with a remarkable +15.3\% improvement in Top-3 accuracy for the MDM network. Notably, we further support a zero-shot pipeline of fine-grained motion editing, which focuses on detailed editing in both spatial and temporal dimensions via text. The dataset and code will be released on GitHub.
Poster
gaojie lin · Jianwen Jiang · Jiaqi Yang · Zerong Zheng · Chao Liang · ZHANG YUAN · Jingtu Li

[ Exhibit Hall I ]

Abstract
End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals).
Poster
Luming Zhao · Jingwen Xuan · Jiamin Lou · Yonghui Yu · Wenwu Yang

[ Exhibit Hall I ]

Abstract
Academic emotion analysis plays a crucial role in evaluating students' engagement and cognitive states during the learning process. This paper addresses the challenge of automatically recognizing academic emotions through facial expressions in real-world learning environments. While significant progress has been made in facial expression recognition for basic emotions, academic emotion recognition remains underexplored, largely due to the scarcity of publicly available datasets. To bridge this gap, we introduce RAER, a novel dataset comprising approximately 2,700 video clips collected from around 140 students in diverse, natural learning contexts such as classrooms, libraries, laboratories, and dormitories, covering both classroom sessions and individual study. Each clip was annotated independently by approximately ten annotators using two distinct sets of academic emotion labels with varying granularity, enhancing annotation consistency and reliability. To our knowledge, RAER is the first dataset capturing diverse natural learning scenarios. Observing that annotators naturally consider context cues—such as whether a student is looking at a phone or reading a book—alongside facial expressions, we propose CLIP-CAER (CLIP-based Context-aware Academic Emotion Recognition). Our method utilizes learnable text prompts within the vision-language model CLIP to effectively integrate facial expression and context cues from videos. Experimental results demonstrate that CLIP-CAER substantially outperforms state-of-the-art video-based facial …
Poster
Ziyan Guo · Zeyu HU · Na Zhao · De Wen Soh

[ Exhibit Hall I ]

Abstract
Human motion generation and editing are key components of computer vision. However, current approaches in this field tend to offer isolated solutions tailored to specific tasks, which can be inefficient and impractical for real-world applications. While some efforts have aimed to unify motion-related tasks, these methods simply use different modalities as conditions to guide motion generation. Consequently, they lack editing capabilities, fine-grained control, and fail to facilitate knowledge sharing across tasks. To address these limitations and provide a versatile, unified framework capable of handling both human motion generation and editing, we introduce a novel paradigm: Motion-Condition-Motion, which enables the unified formulation of diverse tasks with three concepts: source motion, condition, and target motion. Based on this paradigm, we propose a unified framework, MotionLab, which incorporates rectified flows to learn the mapping from source motion to target motion, guided by the specified conditions. In MotionLab, we introduce the 1) MotionFlow Transformer to enhance conditional generation and editing without task-specific modules; 2) Aligned Rotational Position Encoding to guarantee the time synchronization between source motion and target motion; 3) Task Specified Instruction Modulation; and 4) Motion Curriculum Learning for effective multi-task learning and knowledge sharing across tasks. Notably, our MotionLab demonstrates promising generalization …
Poster
Jiawei Liang · Siyuan Liang · Tianrui Lou · Ming Zhang · liwenjin liwenjin · Dunqiu fan · Xiaochun Cao

[ Exhibit Hall I ]

Abstract
Object detection is widely used in real-world applications such as autonomous driving, yet adversarial camouflage poses a significant threat by deceiving detectors from multiple viewpoints. Existing techniques struggle to maintain consistent attack efficacy across different viewpoints. To address this, we propose GRAC, an adversarial camouflage framework that enhances attack effectiveness across viewpoints and distances. First, we identify conflicts in gradient updates across angles and introduce gradient reweighting to resolve them, enabling coordinated optimization. Second, we model light interactions to simulate illumination changes, improving robustness under varying lighting conditions. Additionally, we address non-uniform texture updates arisen from inconsistent sampling density during rendering by applying pooling-based texture regularization to improve smoothness. Extensive experiments in both simulated and physical environments demonstrate that GRAC outperforms existing methods across diverse conditions.
Poster
Zexin Zheng · Jia-Feng Cai · Xiao-Ming Wu · Yilin Wei · Yu-Ming Tang · Wei-Shi Zheng · Ancong Wu

[ Exhibit Hall I ]

Abstract
The development of a generalist agent with adaptive multiple manipulation skills has been a long-standing goal in the robotics community.In this paper, we explore a crucial task, skill-incremental learning, in robotic manipulation, which is to endow the robots with the ability to learn new manipulation skills based on the previous learned knowledge without re-training. First, we build a skill-incremental environment based on the RLBench benchmark, and explore how traditional incremental methods perform in this setting. We find that they suffer from severe catastrophic forgetting due to the previous methods on classification overlooking the characteristics of temporality and action complexity in robotic manipulation tasks. Towards this end, we propose an incremental Manipulation framework, termed iManip, to mitigate the above issues. We firstly design a temporal replay strategy to maintain the integrity of old skills when learning new skill. Moreover, we propose the extendable PerceiverIO, consisting of an action prompt with extendable weight to adapt to new action primitives in new skill. Extensive experiments show that our framework performs well in Skill-Incremental Learning. Codes of the skill-incremental environment with our framework will be open-source.
Poster
Lanning Zhang · Ying Zhou · Fei Gao · Ziyun Li · Maoying Qiao · Jinlan Xu · Nannan Wang

[ Exhibit Hall I ]

Abstract
Although deep neural networks have achieved remarkable success in various computer vision tasks, they face significant challenges in degraded image understanding due to domain shifts caused by quality variations. Drawing biological inspiration from the human visual system (HVS), which dynamically adjusts perception strategies through contrast gain control and selective attention to salient regions, we propose Quality-Adaptive Normalization (Q-Norm) - a novel normalization method that learns adaptive parameters guided by image quality features. Our approach addresses two critical limitations of conventional normalization techniques: 1) Domain Covariance Shift: Existing methods fail to align feature distributions across different quality domains. Q-Norm implicitly achieves cross-domain alignment through quality-aware parameter adaptation without explicit loss functions. 2) Biological Plausibility: By mimicking HVS's contrast normalization mechanisms and attention-based feature selection, Q-Norm dynamically adjusts the mean and variance parameters using a pre-trained quality assessment model, ensuring robustness to image degradation. Extensive experiments across multiple tasks (image classification, semantic segmentation, object detection) demonstrate that Q-Norm consistently outperforms baseline methods on low-quality images. Code will be made available after peer review.
Poster
Yanwen Fang · Wenqi Jia · Xu Cao · Peng-Tao Jiang · Guodong Li · Jintai CHEN

[ Exhibit Hall I ]

Abstract
Multi-person motion prediction becomes particularly challenging when handling highly interactive scenarios involving extreme motions. Previous works focused more on the case of `moderate' motions (e.g., walking together), where predicting each pose in isolation often yields reasonable results. However, these approaches fall short in modeling extreme motions like lindy-hop dances, as they require a more comprehensive understanding of cross-person dependencies. To bridge this gap, we introduce Proxy-bridged Game Transformer (PGformer), a Transformer-based foundation model that captures the interactions driving extreme multi-person motions. PGformer incorporates a novel cross-query attention module to learn bidirectional dependencies between pose sequences and a proxy unit that subtly controls bidirectional spatial information flow. We evaluate PGFormer on the challenging ExPI dataset, which involves large collaborative movements. Both quantitative and qualitative demonstrate the superiority of PGFormer in both short- and long-term predictions. We also test the proposed method on moderate movement datasets CMU-Mocap and MuPoTS-3D, generalizing PGFormer to scenarios with more than two individuals with promising results.
Poster
Yiwen Chen · Yikai Wang · Yihao Luo · Zhengyi Wang · Zilong Chen · Jun Zhu · Chi Zhang · Guosheng Lin

[ Exhibit Hall I ]

Abstract
Meshes are the de facto 3D representation in the industry but are labor-intensive to produce. Recently, a line of research has focused on autoregressively generating meshes. This approach processes meshes into a sequence composed of vertices and then generates them vertex by vertex, similar to how a language model generates text. These methods have achieved some success but still struggle to generate complex meshes. One primary reason for this limitation is their inefficient tokenization methods. To address this issue, we introduce MeshAnything V2, an advanced mesh generation model designed to create Artist-Created Meshes that align precisely with specified shapes. A key innovation behind MeshAnything V2 is our novel Adjacent Mesh Tokenization (AMT) method. Unlike traditional approaches that represent each face using three vertices, AMT optimizes this by employing a single vertex wherever feasible, effectively reducing the token sequence length by about half on average. This not only streamlines the tokenization process but also results in more compact and well-structured sequences, enhancing the efficiency of mesh generation. With these improvements, MeshAnything V2 effectively doubles the face limit compared to previous models, delivering superior performance without increasing computational costs. Our extensive experiments across various mesh tokenization methods demonstrate that AMT is pivotal …
Poster
Prerit Gupta · Jason Alexander Fotso-Puepi · Zhengyuan Li · Jay Mehta · Aniket Bera

[ Exhibit Hall I ]

Abstract
We introduce Multimodal DuetDance (MDD), a diverse multimodal benchmark dataset designed for text-controlled and music-conditioned 3D duet dance motion generation. Our dataset comprises 620 minutes of high-quality motion capture data performed by professional dancers, synchronized with music, and detailed with over 10K fine-grained natural language descriptions. The annotations capture a rich movement vocabulary, detailing spatial relationships, body movements, and rhythm, making Text2Duet the first dataset to seamlessly integrate human motions, music, and text for duet dance synthesis. We introduce two novel tasks supported by our dataset: (1) Text-to-Duet, where given music and a textual prompt, both the leader and follower dance motion are generated (2) Text-to-Dance Accompaniment, where given music, textual prompt, and the leader's motion, the follower's motion is generated in a cohesive, text-aligned manner.
Poster
Susan Liang · Chao Huang · Yolo Yunlong Tang · Zeliang Zhang · Chenliang Xu

[ Exhibit Hall I ]

Abstract
The Audio-Visual Acoustic Synthesis (AVAS) task aims to model realistic audio propagation behavior within a specific visual scene. Prior works often rely on sparse image representations to guide acoustic synthesis. However, we argue that this approach is insufficient to capture the intricate physical properties of the environment and may struggle with generalization across diverse scenes. In this work, we review the limitations of existing pipelines and address the research question: Can we leverage physical audio-visual associations to enhance neural acoustic synthesis? We introduce Physics-Integrated Audio-Visual Acoustic Synthesis (PI-AVAS or $\pi$-AVAS), a novel framework designed with two key objectives. i) Generalization: We develop a vision-guided audio simulation framework that leverages physics-based sound propagation. By explicitly modeling vision-grounded geometry and sound rays, our approach achieves robust performance across diverse visual environments. ii) Realism: While simulation-based approaches offer generalizability, they often compromise on realism. To mitigate this, we incorporate a second stage for data-centric refinement, where we propose a flow matching-based audio refinement model to narrow the gap between simulation and real-world audio-visual scenes. Extensive experiments demonstrate the effectiveness and robustness of our method. We achieve state-of-the-art performance on the RWAVS-Gen, RWAVS, and RAF datasets. Additionally, we show that our approach can be …
Poster
Yuwen Pan · Rui Sun · Wangkai Li · Tianzhu Zhang

[ Exhibit Hall I ]

Abstract
Semantic segmentation under adverse conditions is crucial for ensuring robust and accurate visual perception in challenging weather conditions. The distinct characteristics of extreme scenarios hinder traditional segmentation paradigms, highlighting the necessity for tailored approaches for adverse weathers. Due to the scarcity of labeled data in such scenarios, the unsupervised domain adaptation paradigm is commonly utilized to leverage knowledge from normal weather conditions. Although existing methods strive to absorb information from labeled normal weather data and unlabeled adverse condition images, they face significant challenges due to weather unawareness and severe feature heterogeneity, thus struggling to effectively parse scenes under adverse conditions. In this paper, we propose a novel weather-aware aggregation and adaptation network that leverages characteristic knowledge to achieve weather homogenization and enhance scene perception. Specifically, we introduce amplitude prompt aggregation to capture essential characteristics from the Fourier frequency domain that are indicative of different weather conditions. Additionally, we employ weather heterogeneity adaptation to mitigate the inter-domain heterogeneity, thereby achieving feature homogenization across diverse environments. Extensive experimental results on multiple challenging benchmarks demonstrate that our method achieves consistent improvements for semantic segmentation under adverse conditions.
Poster
Lanmiao Liu · Esam Ghaleb · asli ozyurek · Zerrin Yumak

[ Exhibit Hall I ]

Abstract
Creating a virtual avatar with semantically coherent gestures that are aligned with speech is a challenging task. Existing gesture generation research mainly focused on generating rhythmic beat gestures, neglecting the semantic context of the gestures. In this paper, we propose a novel approach for semantic grounding in co-speech gesture generation that integrates semantic information at both fine-grained and global levels. Our approach starts with learning the motion prior through a vector-quantized variational autoencoder. Built on this model, a second-stage module is applied to automatically generate gestures from speech, text-based semantics and speaker identity that ensures consistency between the semantic relevance of generated gestures and co-occurring speech semantics through semantic coherence and relevance modules. Experimental results demonstrate that our approach enhances the realism and coherence of semantic gestures. Extensive experiments and user studies show that our method outperforms state-of-the-art approaches across two benchmarks in co-speech gesture generation in both objective and subjective metrics. The qualitative results of our model can be viewed at https://semgesture.github.io. Our code, dataset and pre-trained models will be shared upon acceptance.
Poster
Thomas Dagès · Michael Lindenbaum · Alfred Bruckstein

[ Exhibit Hall I ]

Abstract
Standard convolutions are prevalent in image processing and deep learning, but their fixed kernels limits adaptability. Several deformation strategies of the reference kernel grid have been proposed. Yet, they lack a unified theoretical framework. By returning to a metric perspective for images, now seen as two-dimensional manifolds equipped with notions of local and geodesic distances, either symmetric (Riemannian) or not (Finsler), we provide a unifying principle: the kernel positions are samples of unit balls of implicit metrics. With this new perspective, we also propose metric convolutions, a novel approach that samples unit balls from explicit signal-dependent metrics, providing interpretable operators with geometric regularisation. This framework, compatible with gradient-based optimisation, can directly replace existing convolutions applied to either input images or deep features of neural networks. Metric convolutions typically require fewer parameters and provide better generalisation. Our approach shows competitive performance in standard denoising and classification tasks.
Poster
Baoli Sun · Ning Wang · Xinzhu Ma · Anqi Zou · Lu Yihang · Chuixuan Fan · Zhihui Wang · Kun Lu · Zhiyong Wang

[ Exhibit Hall I ]

Abstract
Understanding the behaviors of robotic arms is essential for various robotic applications such as logistics management, precision agriculture, and automated manufacturing. However, the lack of large-scale and diverse datasets significantly hinders progress in video-based robotic arm action understanding, highlighting the need for collecting a new large-scale dataset. In particular, our RobAVA contains ~40k video sequences with video-level fine-grained annotations, covering basic actions such as picking, pushing, and placing, as well as their combinations in different orders and interactions with various objects. Distinguished to existing action recognition benchmarks, RobAVA includes instances of both normal and anomalous executions for each action category. Our further analysis reveals that the primary challenge in robotic arm action recognition lies in the fact that a complete action consists of a sequence of fundamental, atomic behaviors, requiring models to learn the inter-relationships among them. To this end, we propose a novel baseline approach, AGPT-Net, which re-defines the problem of understanding robotic arm actions as a task of aligning video sequences with atomic attributes.To enhance AGPT-Net's ability to distinguish normal and anomalous action instances, we introduce a joint semantic space constraint between category and attribute semantics, thereby amplifying the separation between normal and anomalous attribute representations for each …
Poster
Sunpill Kim · Seunghun Paik · Chanwoo Hwang · Dongsoo Kim · Junbum Shin · Jae Hong Seo

[ Exhibit Hall I ]

Abstract
As face recognition systems (FRS) become more widely used, user privacy becomes more important. A key privacy issue in FRS is to protect the user’s face template, since the characteristics of the user’s face image can be recovered from the template. Although recent advances in cryptographic tools such as homomorphic encryption (HE) have provided opportunities for securing the FRS, HE cannot be used directly with FRS in an efficient plug-and-play manner. In particular, although HE is functionally complete for arbitrary programs, it is basically designed for algebraic operations on encrypted data of predetermined shape such as a polynomial ring. Thus, a non-tailored combination of HE and the system can yield very inefficient performance, and many previous HE-based face template protection methods are hundreds of times slower than plain systems without protection. In this study, we propose $\mathsf{IDFace}$, a new HE-based secure and efficient face identification method with template protection. The $\mathsf{IDFace}$ is designed on the basis of two novel techniques for efficient searching on a (homomorphically encrypted) biometric database with an angular metric. The first technique is a template representation transformation that sharply reduces the unit cost for the matching test. The second is a space-efficient encoding that reduces wasted …
Poster
Ziwei Wang · Sameera Ramasinghe · Chenchen Xu · Julien Monteil · Loris Bazzani · Thalaiyasingam Ajanthan

[ Exhibit Hall I ]

Abstract
Structuring latent representations in a hierarchical manner enables models to learn patterns at multiple levels of abstraction. However, most prevalent image understanding models focus on visual similarity, and learning visual hierarchies is relatively unexplored. In this work, for the first time, we introduce a learning paradigm that can encode user-defined multi-level complex visual hierarchies in hyperbolic space without requiring explicit hierarchical labels. As a concrete example, first, we define a part-based image hierarchy using object-level annotations within and across images. Then, we introduce an approach to enforce the hierarchy using contrastive loss with pairwise entailment metrics. Finally, we discuss new evaluation metrics to effectively measure hierarchical image retrieval. Encoding these complex relationships ensures that the learned representations capture semantic and structural information that transcends mere visual similarity. Experiments in part-based image retrieval show significant improvements in hierarchical retrieval tasks, demonstrating the capability of our model in capturing visual hierarchies.
Poster
Junzhe Lu · Jing Lin · Hongkun Dou · Ailing Zeng · Yue Deng · Xian Liu · Zhongang Cai · Lei Yang · YULUN ZHANG · Haoqian Wang · Ziwei Liu

[ Exhibit Hall I ]

Abstract
We present DPoser-X, a diffusion-based prior model for 3D whole-body human poses. Building a versatile and robust full-body human pose prior remains challenging due to the inherent complexity of articulated human poses and the scarcity of high-quality whole-body pose datasets. To address these limitations, we introduce a Diffusion model as body Pose prior (DPoser) and extend it to DPoser-X for expressive whole-body human pose modeling.Our approach unifies various pose-centric tasks as inverse problems, solving them through variational diffusion sampling. To enhance performance on downstream applications, we introduce a novel truncated timestep scheduling method specifically designed for pose data characteristics. We also propose a masked training mechanism that effectively combines whole-body and part-specific datasets, enabling our model to capture interdependencies between body parts while avoiding overfitting to specific actions.Extensive experiments demonstrate DPoser-X's robustness and versatility across multiple benchmarks for body, hand, face, and full-body pose modeling. Our model consistently outperforms state-of-the-art alternatives, establishing a new benchmark for whole-body human pose prior modeling.
Poster
Zhongze Wang · Haitao Zhao · Lujian Yao · Jingchao Peng · Kaijie Zhao

[ Exhibit Hall I ]

Abstract
Images captured under severe weather conditions often suffer from complex, composite degradations, varying in intensity. In this paper, we introduce a novel method, Dual-Level Prototype Learning (DPL), to tackle the challenging task of composite degraded image restoration. Unlike previous methods that rely on fixed embeddings to characterize degradation types, DPL maintains a number of degradation-level prototypes to represent the specific degradation scenes dynamically. Furthermore, considering the diverse factors influencing each degradation type, factor-level prototypes are incorporated to capture variations in individual degradation factors. Image features are matched with both degradation-level and factor-level prototypes, producing detailed scene embeddings that enhance the network's understanding of composite degradations. These scene embeddings are then processed through Dual Scene Embedding Transformer Blocks to guide the restoration process. To further refine the prototype distribution, we propose a Prototype Scatter Learning Loss, which enables prototypes within the same degradation to learn more information and push prototypes between different degradations to be separate. Additionally, we introduce a new dataset named Variable Composite Degradation (VCD) dataset which contains images with different intensities of each type of composite degradation to validate the efficacy of our method. Extensive experiments demonstrate that DPL significantly outperforms existing methods in restoring images with composite …
Poster
zhiliang wu · Kerui Chen · Kun Li · Hehe Fan · Yi Yang

[ Exhibit Hall I ]

Abstract
Video inpainting aims to fill in corrupted regions of the video with plausible contents. Existing methods generally assume that the locations of corrupted regions are known, focusing primarily on the “how to inpaint”. This reliance necessitates manual annotation of the corrupted regions using binary masks to indicate “where to inpaint”. However, the annotation of these masks is labor-intensive and expensive, limiting the practicality of current methods. In this paper, we expect to relax this assumption by defining a new blind video inpainting setting, enabling the networks to learn the mapping from corrupted video to inpainted result directly, eliminating the need of corrupted region annotations. Specifically, we propose an end-to-end blind video inpainting network (BVINet) to address both “where to inpaint” and “how to inpaint” simultaneously. On the one hand, BVINet can predict the masks of corrupted regions by detecting semantic-discontinuous regions of the frame and utilizing temporal consistency prior of the video. On the other hand, the predicted masks are incorporated into the BVINet, allowing it to capture valid context information from uncorrupted regions to fill in corrupted ones. Besides, we introduce a consistency loss to regularize the training parameters of BVINet. In this way, mask prediction and video completion …
Poster
Chang Liu · Yunfan Ye · Fan Zhang · Qingyang Zhou · Yuchuan Luo · Zhiping Cai

[ Exhibit Hall I ]

Abstract
Numerous synthesized videos from generative models, especially human-centric ones that simulate realistic human actions, pose significant threats to human information security and authenticity. While progress has been made in binary forgery video detection, the lack of fine-grained understanding of forgery types raises concerns regarding both reliability and interpretability, which are critical for real-world applications. To address this limitation, we propose HumanSAM, a new framework that builds upon the fundamental challenges of video generation models. Specifically, HumanSAM aims to classify human-centric forgeries into three distinct types of artifacts commonly observed in generated content: spatial, appearance, and motion anomaly.To better capture the features of geometry, semantics and spatiotemporal consistency, we propose to generate the human forgery representation by fusing two branches of video understanding and spatial depth. We also adopt a rank-based confidence enhancement strategy during the training process to learn more robust representation by introducing three prior scores. For training and evaluation, we construct the first public benchmark, the Human-centric Forgery Video (HFV) dataset, with all types of forgeries carefully annotated semi-automatically. In our experiments, HumanSAM yields promising results in comparison with state-of-the-art methods, both in binary and multi-class forgery classification.
Poster
Yatian Pang · Bin Zhu · Bin Lin · Mingzhe Zheng · Francis Tay · Ser-Nam Lim · Harry Yang · Li Yuan

[ Exhibit Hall I ]

Abstract
In this work, we present DreamDance, a novel method for animating human images using only skeleton pose sequences as conditional inputs. Existing approaches struggle with generating coherent, high-quality content in an efficient and user-friendly manner. Concretely, baseline methods relying on only 2D pose guidance lack the cues of 3D information like depth and normal maps, leading to suboptimal results. Other works introduce extra representations to provide additional 3D information but inevitably involve a cumbersome and time-intensive process. To address these limitations, DreamDance enriches 3D geometry cues from 2D poses by introducing an efficient diffusion model, enabling high-quality human image animation with various guidance. Our key insight is that human images naturally exhibit multiple levels of correlation, progressing from coarse skeleton poses to fine-grained geometry cues, and further from these geometry cues to explicit appearance details. Capturing such correlations could enrich the guidance signals, facilitating intra-frame coherency and inter-frame consistency. Specifically, we construct the TikTok-Dance5K dataset, comprising 5K high-quality dance videos with detailed frame annotations, including human pose, depth, and normal maps. Next, we introduce a Mutually Aligned Geometry Diffusion Model to generate fine-grained depth and normal maps for enriched guidance. Finally, a Cross-domain Controller incorporates multi-level guidance to animate human …
Poster
Wanquan Feng · Tianhao Qi · Jiawei Liu · Mingzhen Sun · Pengqi Tu · Tianxiang Ma · Fei Dai · Songtao Zhao · SiYu Zhou · Qian HE

[ Exhibit Hall I ]

Abstract
Motion controllability is crucial in video synthesis. However, most previous methods are limited to single control types, and combining them often results in logical conflicts. In this paper, we propose a disentangled and unified framework, namely I2VControl, to overcome the logical conflicts. We rethink camera control, object dragging, and motion brush, reformulating all tasks into a consistent representation based on point trajectories, each managed by a dedicated formulation. Accordingly, we propose a spatial partitioning strategy, where each unit is assigned to a concomitant control category, enabling diverse control types to be dynamically orchestrated within a single synthesis pipeline without conflicts. Furthermore, we design an adapter structure that functions as a plug-in for pre-trained models and is agnostic to specific model architectures. We conduct extensive experiments, achieving excellent performance on various control tasks, and our method further facilitates user-driven creative combinations, enhancing innovation and creativity. Please see the video results in our anonymous github repository: https://github.com/iccv2025sub592/sub592.
Poster
Shuangkang Fang · I-Chao Shen · Yufeng Wang · Yi-Hsuan Tsai · Yi Yang · Shuchang Zhou · Wenrui Ding · Takeo Igarashi · Ming-Hsuan Yang

[ Exhibit Hall I ]

Abstract
We present MeshLLM, a novel framework that leverages large language models (LLMs) to understand and generate text-serialized 3D meshes. Our approach addresses key limitations in existing methods, including the limited dataset scale when catering to LLMs' token length and the loss of 3D structural information during mesh serialization. We introduce a Primitive-Mesh decomposition strategy, which divides 3D meshes into structurally meaningful subunits. This enables the creation of a large-scale dataset with 1500k+ samples, almost 50x larger than previous methods, which aligns better with the LLM scaling law principles. Furthermore, we propose inferring face connectivity from vertices and local mesh assembly training strategies, significantly enhancing the LLMs' ability to capture mesh topology and spatial structures. Experiments show that MeshLLM outperforms the state-of-the-art LLaMA-Mesh in both mesh generation quality and shape understanding, highlighting its great potential in processing text-serialized 3D meshes.
Poster
Yiming Wu · Huan Wang · Zhenghao Chen · Jianxin Pang · Dong Xu

[ Exhibit Hall I ]

Abstract
Diffusion Policies have significantly advanced robotic manipulation tasks via imitation learning, but their application on resource-constrained mobile platforms remains challenging due to computational inefficiency and extensive memory footprint. In this paper, we propose \textbf{LightDP}, a novel framework specifically designed to accelerate Diffusion Policies for real-time deployment on mobile devices. LightDP addresses the computational bottleneck through two core strategies: network compression of the denoising modules and reduction of the required sampling steps. We first conduct an extensive computational analysis on existing Diffusion Policy architectures, identifying the denoising network as the primary contributor to latency. To overcome performance degradation typically associated with conventional pruning methods, we introduce a unified pruning and retraining pipeline, optimizing the model's post-pruning recoverability explicitly. Furthermore, we combine pruning techniques with consistency distillation to effectively reduce sampling steps while maintaining action prediction accuracy. Experimental evaluations on three standard datasets, \ie, Push-T, CALVIN, and LIBERO, demonstrate that LightDP achieves real-time action prediction on mobile devices with competitive performance, marking an important step toward practical deployment of diffusion-based policies in resource-limited environments.
Poster
Jaejun Hwang · Dayoung Gong · Manjin Kim · Minsu Cho

[ Exhibit Hall I ]

Abstract
Generic event boundary detection (GEBD) aims to identify natural boundaries in a video, segmenting it into distinct and meaningful chunks. Despite the inherent subjectivity of event boundaries, previous methods have focused on deterministic predictions, overlooking the diversity of plausible solutions.In this paper, we introduce a novel diffusion-based boundary detection model, dubbed DiffGEBD, that tackles the problem of GEBD from a generative perspective. The proposed model encodes relevant changes across adjacent frames via temporal self-similarity and then iteratively decodes random noise into plausible event boundaries being conditioned on the encoded features. Classifier-free guidance allows the degree of diversity to be controlled in denoising diffusion. In addition, we introduce a new evaluation metric to assess the quality of predictions considering both diversity and fidelity. Experiments show that our method achieves strong performance on two standard benchmarks, TAPOS and Kinetics-GEBD, generating diverse and plausible event boundaries.
Poster
Junyu Lou · Xiaorui Zhao · Kexuan Shi · Shuhang Gu

[ Exhibit Hall I ]

Abstract
Deep learning-based bilateral grid processing has emerged as a promising solution for image enhancement, inherently encoding spatial and intensity information while enabling efficient full-resolution processing through slicing operations. However, existing approaches are limited to linear affine transformations, hindering their ability to model complex color relationships. Meanwhile, while multi-layer perceptrons (MLPs) excel at non-linear mappings, traditional MLP-based methods employ globally shared parameters, which is hard to deal with localized variations. To overcome these dual challenges, we propose a Bilateral Grid-based Pixel-Adaptive Multi-layer Perceptron (BPAM) framework. Our approach synergizes the spatial modeling of bilateral grids with the non-linear capabilities of MLPs. Specifically, we generate bilateral grids containing MLP parameters, where each pixel dynamically retrieves its unique transformation parameters and obtain a distinct MLP for color mapping based on spatial coordinates and intensity values. In addition, we propose a novel grid decomposition strategy that categorizes MLP parameters into distinct types stored in separate subgrids. Multi-channel guidance maps are used to extract category-specific parameters from corresponding subgrids, ensuring effective utilization of color information during slicing while guiding precise parameter generation. Extensive experiments on public datasets demonstrate that our method outperforms state-of-the-art methods in performance while maintaining real-time processing capabilities.
Poster
Zijia Lu · Ehsan Elhamifar

[ Exhibit Hall I ]

Abstract
Procedural videos are critical for learning new tasks. Temporal action segmentation (TAS), which classifies the action in every video frame, has become essential for understanding procedural videos. Existing TAS models, however, are limited to a fixed-set of tasks learned at training and unable to adapt to novel tasks at test time. Thus, we introduce the new problem of Multi-Modal Few-shot Temporal Action Segmentation (MMF-TAS) to learn models that can generalize to novel procedural tasks with minimal visual/textual examples. We propose the first MMF-TAS framework, by designing a Prototype Graph Network (PGNet). PGNet contains a Prototype Building Block that summarizes action information from support videos of the novel tasks via an Action Relation Graph, and encodes this information into action prototypes via a Dynamic Graph Transformer. Next, it employs a Matching Block that compares action prototypes with query videos to infer framewise action labels. To exploit the advantages of both visual and textual modalities, we compute separate action prototypes for each modality and combine the two modalities by a prediction fusion method to avoid overfitting on one modality. By extensive experiments on procedural datasets, we show that our method successfully adapts to novel tasks during inference and significantly outperforms baselines.
Poster
Wenjia Wang · Liang Pan · Zhiyang Dou · Jidong Mei · Zhouyingcheng Liao · Yifan Wu · Yuke Lou · Jingbo Wang · Lei Yang · Taku Komura

[ Exhibit Hall I ]

Abstract
Simulating stylized human-scene interactions (HSI) in physical environments is a challenging yet fascinating task. Prior works emphasize long-term execution but fall short in achieving both diverse style and physical plausibility. To tackle this challenge, we introduce a novel hierarchical framework named SIMS that seamlessly bridges high-level script-driven intent with a low-level control policy, enabling more expressive and diverse human-scene interactions. Specifically, we employ Large Language Models with Retrieval-Augmented Generation (RAG) to generate coherent and diverse long-form scripts, providing a rich foundation for motion planning. A versatile multi-condition physics-based control policy is also developed, which leverages text embeddings from the generated scripts to encode stylistic cues, simultaneously perceiving environmental geometries and accomplishing task goals. By integrating the retrieval-augmented script generation with the multi-condition controller, our approach provides a unified solution for generating stylized HSI motions. We further introduce a comprehensive planning dataset produced by RAG and a stylized motion dataset featuring diverse locomotions and interactions. Extensive experiments demonstrate SIMS's effectiveness in executing various tasks and generalizing across different scenarios, significantly outperforming previous methods.
Poster
Sanjoy Kundu · Shanmukha Vellamcheti · Sathyanarayanan Aakur

[ Exhibit Hall I ]

Abstract
Open-world egocentric activity recognition poses a fundamental challenge due to its unconstrained nature, requiring models to infer unseen activities from an expansive, partially observed search space. We introduce ProbRes, a Probabilistic Residual search framework based on jump-diffusion that efficiently navigates this space by balancing prior-guided exploration with likelihood-driven exploitation. Our approach integrates structured commonsense priors to construct a semantically coherent search space, adaptively refines predictions using Vision-Language Models (VLMs) and employs a stochastic search mechanism to locate high-likelihood activity labels while minimizing exhaustive enumeration efficiently. We systematically evaluate ProbRes across multiple openness levels (L0–L3), demonstrating its adaptability to increasing search space complexity. In addition to achieving state-of-the-art performance on benchmark datasets (GTEA Gaze, GTEA Gaze+, EPIC-Kitchens, and Charades-Ego), we establish a clear taxonomy for open-world recognition, delineating the challenges and methodological advancements necessary for egocentric activity understanding. Our results highlight the importance of structured search strategies, paving the way for scalable and efficient open-world activity recognition. Code (in supplementary) will be shared publicly after review.
Poster
Jianlong Jin · Chenglong Zhao · Ruixin Zhang · Sheng Shang · Yang Zhao · Jun Wang · Jingyun Zhang · Shouhong Ding · Wei Jia · Yunsheng Wu

[ Exhibit Hall I ]

Abstract
Current palmprint recognition models achieve strong performance on constrained datasets,yet exhibit significant limitations in handling challenging palmprint samples with geometric distortions and textural degradations. Data augmentation is widely adopted to improve model generalization.However, existing augmentation methods struggle to generate palmprint-specific variations while preserving identity consistency,leading to suboptimal performance.To address these problems, we propose a unified adversarial augmentation framework.It first utilizes an adversarial training paradigm for palmprint recognition, optimizing for challenging augmented samples by incorporating the feedback from the recognition network.We enhance palmprint images with both geometric and textual variations.Specifically, it adopts a spatial transformation module and a new identity-preserving module, which synthesizes palmprints with diverse textural variations while maintaining consistent identity.For more effective adversarial augmentation, a dynamic sampling strategy is proposed.Extensive experiments demonstrate the superior performance of our method on both challenging and constrained palmprint datasets. Our code will be released.
Poster
hongjun wang · Jiyuan Chen · Zhengwei Yin · Xuan Song · Yinqiang Zheng

[ Exhibit Hall I ]

Abstract
Generalizable Image Super-Resolution aims to enhance model generalization capabilities under unknown degradations. To achieve such goal, the models are expected to focus only on image content-related features instead of degradation details (i.e., overfitting degradations).Recently, numerous approaches such as dropout and feature alignment have been proposed to suppress models' natural tendency to overfitting degradations and yields promising results. Nevertheless, these works have assumed that models overfit to all degradation types (e.g., blur, noise), while through careful investigations in this paper, we discover that models predominantly overfit to noise, largely attributable to the distinct degradation pattern in noise compared to other degradation types. In this paper, we propose a targeted feature denoising framework, comprising noise detection and denoising modules. Our approach represents a general solution that can be seamlessly integrated with existing super-resolution models without requiring architectural modifications. Our framework demonstrates superior performance compared to previous regularization-based methods across five traditional benchmark and datasets, encompassing both synthetic and real-world scenarios.
Poster
Liam Schoneveld · Zhe Chen · Davide Davoli · Jiapeng Tang · Saimon Terazawa · Ko Nishino · Matthias Nießner

[ Exhibit Hall I ]

Abstract
Accurate, real-time 3D reconstruction of human heads from monocular images and videos underlies numerous visual applications. As 3D ground truth data is hard to come by at scale, previous methods have sought to learn from abundant 2D videos in a self-supervised manner. Typically, this involves the use of differentiable mesh rendering, which is effective but faces limitations. To improve on this, we propose SHeaP (Self-supervised Head Geometry Predictor Learned via 2D Gaussians).Given a source image, we predict a 3DMM mesh and a set of Gaussians that are rigged to this mesh. We then reanimate this rigged head avatar to match a target frame, and backpropagate photometric losses to both the 3DMM and Gaussian prediction networks. We find that using Gaussians for rendering substantially improves the effectiveness of this self-supervised approach.Training solely on 2D data, our method surpasses existing self-supervised approaches in geometric evaluations on the NoW benchmark for neutral faces and a new benchmark for non-neutral expressions. Our method also produces highly expressive meshes, outperforming state-of-the-art in emotion classification.
Poster
Yichen Li · Antonio Torralba

[ Exhibit Hall I ]

Abstract
General-purpose household robots require real-time fine motor control to handle delicate tasks and urgent situations. In this work, we introduce the senses of proprioception, kinesthesia, force haptics, and muscle activation to capture such precise control. This comprehensive set of multimodal senses naturally enables fine-grained interactions that are difficult to simulate with unimodal or text con-ditioned generative models. To effectively simulate fine-grained multisensory actions, we develop a feature learning paradigm that aligns these modalities while preserving the unique information each modality provides. We further regularize action trajectory features to enhance causality for representing intricate interaction dynamics. Experiments show that incorporating multimodal senses improves simulation accuracy and reduces temporal drift. Extensive ablation studies and downstream applications demonstrate effectiveness and practicality of our work.
Poster
Lingteng Qiu · Xiaodong Gu · Peihao Li · Qi Zuo · Weichao Shen · Junfei Zhang · Kejie Qiu · Weihao Yuan · Guanying Chen · Zilong Dong · Liefeng Bo

[ Exhibit Hall I ]

Abstract
Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation. Recent advances in 3D human reconstruction mainly focus on static human modeling, and the reliance of using synthetic 3D scans for training limits their generalization ability. Conversely, optimization-based video methods achieve higher fidelity but demand controlled capture conditions and computationally intensive refinement processes. Motivated by the emergence of large reconstruction models for efficient static reconstruction, we propose LHM (Large Animatable Human Reconstruction Model) to infer high-fidelity avatars represented as 3D Gaussian splatting in a feed-forward pass. Our model leverages a multimodal transformer architecture to effectively encode the human body positional features and image features with attention mechanism, enabling detailed preservation of clothing geometry and texture. To further boost the face identity preservation and fine detail recovery, we propose a head feature pyramid encoding scheme to aggregate multi-scale features of the head regions. Extensive experiments demonstrate that our LHM generates plausible animatable human in seconds without post-processing for face and hands, outperforming existing methods in both reconstruction accuracy and generalization ability.
Poster
Chengxu Liu · Lu Qi · Jinshan Pan · Xueming Qian · Ming-Hsuan Yang

[ Exhibit Hall I ]

Abstract
Since acquiring large amounts of realistic blurry-sharp image pairs is difficult and expensive, learning blind image deblurring from unpaired data is a more practical and promising solution. Unfortunately, most existing approaches only use adversarial learning to bridge the gap from blurry domains to sharp domains, ignoring the complex and unpredictable nature of real-world blurry patterns. In this paper, we propose a novel diffusion model (DM)-based framework, dubbed TP-Diff, for image deblurring by learning spatially varying texture prior from unpaired sharp data. In particular, TP-Diff performs DM to generate the prior knowledge used to recover the texture of blurry images. To implement it, we propose a Texture Prior Encoder (TPE) that introduces a memory mechanism to encode the texture prior and thereby provide supervision for the DM training. To fully exploit the generated texture priors, we further present the Texture Transfer Transformer layer (TTformer), in which a novel Filter-Modulated Multi-head Self-Attention (FM-MSA) efficiently removes spatially varying blurring through adaptive filtering. In addition, a wavelet-based adversarial loss is used to preserve high-frequency texture details. Extensive evaluations demonstrate that TP-Diff provides a promising unsupervised deblurring solution and outperforms SOTA methods in six widely-used benchmarks.
Poster
Dongbin Zhang · Yunfei Liu · Lijian Lin · Ye Zhu · Yang Li · Minghan Qin · Yu Li · Haoqian Wang

[ Exhibit Hall I ]

Abstract
Reconstructing high-quality, animatable 3D human avatar with expressive facial and hand motions from a single image has gained significant attention due to its broad application potential. 3D human avatar reconstruction typically requires multi-view or monocular videos and training on individual IDs, which is both complex and time-consuming. Furthermore, limited by SMPLX’s expressiveness, these methods often focus on body motion but struggle with facial expressions. To address these challenges, we first introduce an expressive human model (EHM) to enhance facial expression capabilities and develop an accurate tracking method. Based on this template model, we propose GUAVA, the first framework for fast animatable upper body 3D Gaussian avatar reconstruction. We leverage inverse texture mapping and projection sampling techniques to infer Ubody (upper-body) Gaussians from a single image. The rendered images are refined through a neural refiner. Experimental results demonstrate that GUAVA significantly outperforms previous methods in rendering quality and offers significant speed improvements, with reconstruction times in the sub-second range (~ 0.1s), and supports real-time animation and rendering.
Poster
Ziyi Wang · Peiming Li · Hong Liu · Zhichao Deng · Can Wang · Jun Liu · Junsong Yuan · Mengyuan Liu

[ Exhibit Hall I ]

Abstract
Natural Human-Robot Interaction (N-HRI) requires a robot to recognize human actions at varying distances while accounting for disturbing motions from either the human or the robot. However, existing human action datasets are primarily designed for conventional Human-Robot Interaction (HRI) and fail to meet the unique requirements of N-HRI due to limited data, data modalities, task categories, and diversity in subjects and environments. To address this, we introduce ACTIVE, a large-scale human action dataset focused on ACtions from RoboTIc ViEw. Our dataset includes 30 action categories, 80 participants and 46,868 video instances, encompassing both point cloud and RGB modalities. During data capture, participants perform a range of human actions in diverse environments at varying distances (from 3m to 50m), while also executing disturbing motions, and with the robot itself in different states of motion. To recognize actions from a robotic view, we propose ACTIVE-PC, a Point Cloud-based method for ACTIVE dataset, which is able to recognize human actions at long distances using our proposed Multilevel Neighborhood Sampling, Layered Recognizers, and Elastic Ellipse Query, along with precise decoupling of kinematic interference and human actions. Experimental results verify the effectiveness of our method. Our project page is https://active2750.github.io/.
Poster
Tri Ton · Ji Woo Hong · Chang Yoo

[ Exhibit Hall I ]

Abstract
This paper introduces Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning (TARO), a novel framework for high-fidelity and temporally coherent video-to-audio synthesis. Built upon flow-based transformers, which offer stable training and continuous transformations for enhanced synchronization and audio quality, TARO introduces two key innovations: (1) Timestep-Adaptive Representation Alignment (TRA), which dynamically aligns latent representations by adjusting alignment strength based on the noise schedule, ensuring smooth evolution and improved fidelity, and (2) Onset-Aware Conditioning (OAC), which integrates onset cues that serve as sharp event-driven markers of audio-relevant visual moments to enhance synchronization with dynamic visual events. Extensive experiments on the VGGSound and Landscape datasets demonstrate that TARO outperforms prior methods, achieving relatively 53\% lower Frechet Distance (FD), 29\% lower Frechet Audio Distance (FAD), and a 97.19\% Alignment Accuracy, highlighting its superior audio quality and synchronization precision.
Poster
Dayong Su · Yafei Zhang · Huafeng Li · Jinxing Li · Yu Liu

[ Exhibit Hall I ]

Abstract
Current multimodal medical image fusion typically assumes that source images are of high quality and perfectly aligned at the pixel level. Its effectiveness heavily relies on these conditions and often deteriorates when handling misaligned or degraded medical images. To address this, we propose UniFuse, a general fusion framework. By embedding a degradation-aware prompt learning module, UniFuse seamlessly integrates multi-directional information from input images and correlates cross-modal alignment with restoration, enabling joint optimization of both tasks within a unified framework. Additionally, we design an Omni Unified Feature Representation scheme, which leverages Spatial Mamba to encode multi-directional features and mitigate modality differences in feature alignment. To enable simultaneous restoration and fusion within an All-in-One configuration, we propose a Universal Feature Restoration \& Fusion module, incorporating the Adaptive LoRA Synergistic Network (ALSN) based on LoRA principles. By leveraging ALSN’s adaptive feature representation along with degradation-type guidance, we enable joint restoration and fusion within a single-stage framework. Compared to staged approaches, Unifuse unifies alignment, restoration, and fusion within a single framework. Experimental results across multiple datasets demonstrate the method’s effectiveness and significant advantages over existing approaches.
Poster
Jiawei He · Danshi Li · Xinqiang Yu · Zekun Qi · Wenyao Zhang · Jiayi Chen · Zhaoxiang Zhang · Zhizheng Zhang · Li Yi · He Wang

[ Exhibit Hall I ]

Abstract
As large models begin to gain momentum, vision-language foundation models are enabling robots to generalizably perform more and more tasks. However, due to the difficulty in data collection, the benefits are limited with simple embodiments. In this paper, we present \textbf{DexVLG}, a vision-language model that predicts language instruction-aligned dexterous grasp poses given single-view RGBD perception. To achieve this, we first synthesize a dataset of 170M dexterous grasp poses aligned with semantic parts on 174k objects in simulation, paired with informative part-level captions. With this large-scale dataset named \textbf{DexGraspNet 3.0}, we train a flow-matching VLM to generate instruction-aligned grasp poses on tabletop objects. To evaluate DexVLG, we curate benchmarks in physics-based simulation and perform real-world experiments. Our extensive experiments demonstrate DexVLG's great zero-shot generalizability, achieving over 76\% zero-shot execution success rate and state-of-the art part grasp accuracy in simulation, and demonstrate successful part-aligned grasps on real-world objects.
Poster
Yifan Zhan · Qingtian Zhu · Muyao Niu · Mingze Ma · Jiancheng Zhao · Zhihang Zhong · Xiao Sun · Yu Qiao · Yinqiang Zheng

[ Exhibit Hall I ]

Abstract
In this paper, we highlight a critical yet often overlooked factor in most 3D human tasks, namely modeling complicated 3D human with with hand-held objects or loose-fitting clothing. It is known that the parameterized formulation of SMPL is able to fit human skin; while hand-held objects and loose-fitting clothing, are difficult to get modeled within the unified framework, since their movements are usually decoupled with the human body.To enhance the capability of SMPL skeleton in response to this situation, we propose a growth strategy that enables the joint tree of the skeleton to expand adaptively. Specifically, our method, called ToMiE, consists of parent joints localization and external joints optimization. For parent joints localization, we employ a gradient-based approach guided by both LBS blending weights and motion kernels.Once the external joints are obtained, we proceed to optimize their transformations in SE(3) across different frames, enabling rendering and explicit animation.ToMiE manages to outperform other methods across various cases with hand-held objects and loose-fitting clothing, not only in rendering quality but also by offering free animation of grown joints, thereby enhancing the expressive ability of SMPL skeleton for a broader range of applications.
Poster
Aggelina Chatziagapi · Louis-Philippe Morency · Hongyu Gong · Michael Zollhöfer · Dimitris Samaras · Alexander Richard

[ Exhibit Hall I ]

Abstract
We introduce AV-Flow, an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input. In contrast to prior work that assumes an existing speech signal, we synthesize speech and vision jointly. We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose; all generated from just text characters. The core premise of our approach lies in the architecture of our two parallel diffusion transformers. Intermediate highway connections ensure communication between the audio and visual modalities, and thus, synchronized speech intonation and facial dynamics (e.g., eyebrow motion). Our model is trained with flow matching, leading to expressive results and fast inference. In case of dyadic conversations, AV-Flow produces an always-on avatar, that actively listens and reacts to the audio-visual input of a user. Through extensive experiments, we show that our method outperforms prior work, synthesizing natural-looking 4D talking avatars.
Poster
Xu Yang · Shaoli Huang · Shenbo Xie · Xuelin Chen · Yifei Liu · Changxing Ding

[ Exhibit Hall I ]

Abstract
Co-speech gesture video generation aims to synthesize realistic, audio-aligned videos of speakers, complete with synchronized facial expressions and body gestures. This task presents challenges due to the significant one-to-many mapping between audio and visual content, further complicated by the scarcity of large-scale public datasets and high computational demands. We propose a lightweight framework that utilizes 2D full-body skeletons as an efficient auxiliary condition to bridge audio signals with visual outputs. Our approach introduces a diffusion model conditioned on fine-grained audio segments and a skeleton extracted from the speaker's reference image, predicting skeletal motions through skeleton-audio feature fusion to ensure strict audio coordination and body shape consistency. The generated skeletons are then fed into an off-the-shelf human video generation model with the speaker's reference image to synthesize high-fidelity videos. To democratize research, we present CSG-405—the first public dataset with 405 hours of high-resolution videos across 71 speech types, annotated with 2D skeletons and diverse speaker demographics. Experiments show that our method exceeds state-of-the-art approaches in visual quality and synchronization while generalizing across speakers and contexts. Code, models, and CSG-405 will be publicly released.
Poster
Yuan Sun · Xuan Wang · Cong Wang · WeiLi Zhang · Yanbo Fan · Yu Guo · Fei Wang

[ Exhibit Hall I ]

Abstract
Recently, 3D head avatar modeling based on 3D Gaussians has demonstrated significant advantages in rendering quality and efficiency, provided there is sufficient data. Some efforts have begun to train prior models on large datasets to develop generalizable 3D Gaussian head avatar modeling methods. Unfortunately, due to the limited expressive power of identity-shared 3D representations, the prior-based modeling often result in degenerated rendering quality. To overcome this limitation, we propose to formulate the 3D Gaussian head avatar modeling as a joint reconstruction and registration problem. Given static input images (e.g., a short mobile phone capture), we optimize two sets of 3D Gaussians: the prior-based one possesses complete animation rigging information inferred from the prior model and produces plausible modeling results, while the prior-free one is used to more freely capture the fine-grained geometric and texture details in the input images. Additionally, we simultaneously solve the registration problem between the two 3D Gaussian sets. On one hand, the registration results will provide binding information for the prior-free reconstruction to make it animatable. On the other hand, during optimization, the prior-based Gaussian can regularize the prior-free reconstruction to resist overfitting and perform good in novel expressions. Finally, we merge the parts of the …
Poster
Ruoxi Guo · Huaijin Pi · Zehong Shen · Qing Shuai · zechenhu zechenhu · Zhumei Wang · Yajiao Dong · Ruizhen Hu · Taku Komura · Sida Peng · Xiaowei Zhou

[ Exhibit Hall I ]

Abstract
Text-driven human motion synthesis has showcased its potential for revolutionizing motion design in the movie and game industry.Existing methods often rely on 3D motion capture data, which requires special setups, resulting in high costs for data acquisition, ultimately limiting the diversity and scope of human motion. In contrast, 2D human videos offer a vast and accessible source of motion data, covering a wider range of styles and activities.In this paper, we explore the use of 2D human motion extracted from videos as an alternative data source to improve text-driven 3D motion generation.Our approach introduces a novel framework that disentangles local joint motion from global movements, enabling efficient learning of local motion priors from 2D data.We first train a single-view 2D local motion generator on a large dataset of text-2D motion pairs.Then we fine-tune the generator with 3D data, transforming it into a multi-view generator that predicts view-consistent local joint motion and root dynamics.Evaluations on the well-acknowledged datasets and novel text prompts demonstrate that our method can efficiently utilizes 2D data, supporting a wider range of realistic 3D human motion generation.
Poster
Sejin Park · Sangmin Lee · Kyong Hwan Jin · Seung-Won Jung

[ Exhibit Hall I ]

Abstract
Super-resolution (SR) has been a pivotal task in image processing, aimed at enhancing image resolution across various applications. Recently, look-up table (LUT)-based approaches have attracted interest due to their efficiency and performance. However, these methods are typically designed for fixed scale factors, making them unsuitable for arbitrary-scale image SR (ASISR). Existing ASISR techniques often employ implicit neural representations, which come with considerable computational cost and memory demands. To address these limitations, we propose Interpolation Mixing LUT (IM-LUT), a novel framework that operates ASISR by learning to blend multiple interpolation functions to maximize their representational capacity. Specifically, we introduce IM-Net, a network trained to predict mixing weights for interpolation functions based on local image patterns and the target scale factor. To enhance efficiency of interpolation-based methods, IM-Net is transformed into IM-LUT, where LUTs are employed to replace computationally expensive operations, enabling lightweight and fast inference on CPUs while preserving reconstruction quality. Experimental results on several benchmark datasets demonstrate that IM-LUT consistently achieves a superior balance between image quality and efficiency compared to existing methods, highlighting its potential as a promising solution for resource-constrained applications.
Poster
Gong Meiqi · Hao Zhang · Xunpeng Yi · Linfeng Tang · Jiayi Ma

[ Exhibit Hall I ]

Abstract
Existing multi-modal fusion methods typically extend image fusion techniques directly to video fusion tasks, which discard inherent temporal information and struggle to maintain temporal consistency between video frames. To address this limitation, we propose a comprehensive method specifically designed for multi-modal video fusion, leveraging a temporally consistent framework with visual-semantic collaboration to simultaneously ensure visual fidelity, semantic accuracy, and temporal consistency. First, we introduce a visual-semantic interaction module consisting of a semantic branch and a visual branch, with Dinov2 and VGG19 employed for distillation. This approach enables the simultaneous and targeted enhancement of both the visual and semantic representations of videos for the first time. Second, we pioneer integrate the video degradation enhancement task into the video fusion pipeline by constructing a temporal cooperative module, which leverages temporal dependencies to facilitate weak information recovery. Third, to ensure temporal consistency, we embed a temporal-enhanced mechanism into the network and devise a temporal loss to guide the optimization process. Finally, we introduce two innovative metrics tailored for video fusion, aimed at evaluating the temporal consistency of the generated fused videos. Extensive experimental results on public video datasets validate the superiority of our method.
Poster
Rundong Luo · Matthew Wallingford · Ali Farhadi · Noah Snavely · Wei-Chiu Ma

[ Exhibit Hall I ]

Abstract
360$^\circ$ videos have emerged as a promising medium to represent our dynamic visual world. Compared to the "tunnel vision" of standard cameras, their borderless field of view offers a more complete perspective of our surroundings. While existing video models excel at producing standard videos, their ability to generate full panoramic videos remains elusive. In this paper, we investigate the task of video-to-360$^\circ$ generation: given a perspective video as input, our goal is to generate a full panoramic video that is consistent with the original video. Unlike conventional video generation tasks, the output's field of view is significantly larger, and the model is required to have a deep understanding of both the spatial layout of the scene and the dynamics of objects to maintain spatio-temporal consistency. To address these challenges, we first leverage the abundant 360$^\circ$ videos available online and develop a high-quality data filtering pipeline to curate pairwise training data. We then carefully design a series of geometry- and motion-aware operations to facilitate the learning process and improve the quality of 360$^\circ$ video generation. Experimental results demonstrate that our model can generate realistic and coherent 360$^\circ$ videos from in-the-wild perspective video. In addition, we showcase its potential applications, including video …
Poster
Stathis Galanakis · Alexandros Lattas · Stylianos Moschoglou · Bernhard Kainz · Stefanos Zafeiriou

[ Exhibit Hall I ]

Abstract
Despite recent progress in diffusion models, generating realistic head portraits from novel viewpoints remains a significant challenge in computer vision. Most current approaches are constrained to limited angular ranges, predominantly focusing on frontal or near-frontal views. Moreover, although the recent emerging large-scale diffusion models have been proven robust in handling 3D scenes, they underperform on facial data, given their complex structure and the uncanny valley pitfalls. In this paper, we propose SpinMeRound, a diffusion-based approach designed to generate consistent and accurate head portraits from novel viewpoints. By leveraging a number of input views alongside an identity embedding, our method effectively synthesizes diverse viewpoints of a subject whilst robustly maintaining its unique identity features. Through experimentation, we showcase our model's generation capabilities in full head synthesis, while beating current state-of-the-art multi-view diffusion models.
Poster
Linzhan Mou · Jiahui Lei · Chen Wang · Lingjie Liu · Kostas Daniilidis

[ Exhibit Hall I ]

Abstract
We present DIMO, a generative approach capable of generating diverse 3D motions for arbitrary objects from a single image. The core idea of our work is to leverage the rich priors in well-trained video models to extract the common motion patterns and then embed them into a shared low-dimensional latent space. Specifically, we first generate multiple videos of the same object with diverse motions. We then embed each motion into a latent vector and train a shared motion decoder to learn the distribution of motions represented by a structured and compact motion representation, i.e., neural key point trajectories. The canonical 3D Gaussians are then driven by these key points and fused to model the geometry and appearance. During inference time with learned latent space, we can instantly sample diverse 3D motions in a single-forward pass and support several interesting applications including 3D motion interpolation and language-guided motion generation.
Poster
Saihui Hou · Panjian Huang · Zengbin Wang · Yuan Liu · Zeyu Li · Man Zhang · Yongzhen Huang

[ Exhibit Hall I ]

Abstract
This paper addresses the challenge of animal re-identification, an emerging field that shares similarities with person re-identification but presents unique complexities due to the diverse species, environments and poses. To facilitate research in this domain, we introduce OpenAnimals, a flexible and extensible codebase designed specifically for animal re-identification. We conduct a comprehensive study by revisiting several state-of-the-art person re-identification methods, including BoT, AGW, SBS, and MGN, and evaluate their effectiveness on animal re-identification benchmarks such as HyenaID, LeopardID, SeaTurtleID, and WhaleSharkID. Our findings reveal that while some techniques generalize well, many do not, underscoring the significant differences between the two tasks. To bridge this gap, we propose ARBase, a strong Base model tailored for Animal Re-identification, which incorporates insights from extensive experiments and introduces simple yet effective animal-oriented designs. Experiments demonstrate that ARBase consistently outperforms existing baselines, achieving state-of-the-art performance across various benchmarks.
Poster
Yingjie Chen · Yifang Men · Yuan Yao · Miaomiao Cui · Liefeng Bo

[ Exhibit Hall I ]

Abstract
Motion-controllable image animation is a fundamental task with a wide range of potential applications. Recent works have made progress in controlling camera or object motion via various motion representations, while they still struggle to support collaborative camera and object motion control with adaptive control granularity. To this end, we introduce 3D-aware motion representation and propose an image animation framework, called Perception-as-Control, to achieve fine-grained collaborative motion control. Specifically, we construct 3D-aware motion representation from a reference image, manipulate it based on interpreted user instructions, and perceive it from different viewpoints. In this way, camera and object motions are transformed into intuitive and consistent visual changes. Then, our framework leverages the perception results as motion control signals, enabling it to support various motion-related video synthesis tasks in a unified and flexible way. Experiments demonstrate the superiority of the proposed approach.
Poster
Yunhao Li · Yifan Jiao · Dan Meng · Heng Fan · Libo Zhang

[ Exhibit Hall I ]

Abstract
Open-Vocabulary Multi-Object Tracking (OV-MOT) aims to enable approaches to track objects without being limited to a predefined set of categories. Current OV-MOT methods typically rely primarily on instance-level detection and association, often overlooking trajectory information, which is a unique and essential information of tracking tasks. Utilizing trajectory information can enhance association stability and classification accuracy, especially in cases of occlusion and category ambiguity, thereby improving adaptability to novel classes. Thus motivated, in this paper we propose \textbf{TRACT}, an open-vocabulary tracker that leverages trajectory information to improve both object association and classification in OV-MOT. Specially, we introduce \textit{Trajectory Consistency Reinforcement} (\textbf{TCR}) strategy to maintain continuity across frames while tracking. Furthermore, we propose \textbf{TraCLIP}, a plug-and-play trajectory classification module. It integrates \textit{Trajectory Feature Aggregation} (\textbf{TFA}) and \textit{Trajectory Semantic Enrichment} (\textbf{TSE}) strategies to fully leverage trajectory information from visual and language perspectives, respectively. Experiments on the OV-TAO benchmark demonstrate that our approach significantly improves tracking performance, highlighting trajectory information as a valuable asset for OV-MOT.
Poster
Junjie He · Yifeng Geng · Liefeng Bo

[ Exhibit Hall I ]

Abstract
This paper presents UniPortrait, an innovative human image personalization framework that unifies single- and multi-ID customization with high face fidelity, extensive facial editability, free-form input description, and diverse layout generation. UniPortrait consists of only two plug-and-play modules: an ID embedding module and an ID routing module. The ID embedding module extracts versatile editable facial features with a decoupling strategy for each ID and embeds them into the context space of diffusion models. The ID routing module then combines and distributes these embeddings adaptively to their respective regions within the synthesized image, achieving the customization of single and multiple IDs. With a carefully designed two-stage training scheme, UniPortrait achieves superior performance in both single- and multi-ID customization. Quantitative and qualitative experiments demonstrate the advantages of our method over existing approaches as well as its good scalability, e.g., the universal compatibility with existing generative control tools.
Poster
Jing Wang · Rui Zhao · Ruiqin Xiong · Xingtao Wang · Xiaopeng Fan · Tiejun Huang

[ Exhibit Hall I ]

Abstract
Open-vocabulary action recognition (OVAR) extends recognition systems to identify unseen action categories. While large-scale vision-language models (VLMs) like CLIP have enabled OVAR in image domains, their adaptation to event data remains underexplored. Event cameras offer high temporal resolution and inherent privacy preservation, making them suitable for capturing fine-grained motion dynamics. However, leveraging event data for OVAR presents challenges: 1) bridging the domain gap between static image-based models and event streams, and 2) preserving the generalization capabilities of pretrained VLMs in open-vocabulary settings. In this paper, we propose SAMPLE, a lightweight adaptation of VLMs for event-based action recognition, balancing supervised and open-vocabulary performance. We introduce a \textit{Temporal-Adaptive Multimodal Prompt Learning} strategy that can be divided into: 1) Unimodal prompt on both the event and text branches to learn the data distribution 2) Event-Text cross-modal prompt for representation space alignment 3) Temporal-Adaptive prompt to model temporal dependencies across event data. Extensive evaluations demonstrate that SAMPLE outperforms prior methods across fully supervised, few-shot, base-to-novel and zero-shot settings. Notably, in zero-shot scenarios, SAMPLE achieves gains of +15.46%, +29.76%, and +23.79% on SeAct, DVS128Gesture, and PAF respectively with less commute cost. Our codes are included in the supplementary materials. The codes and models will be …
Poster
LI XIAOJIE · Ronghui Li · Shukai Fang · Shuzhao Xie · Xiaoyang Guo · Jiaqing Zhou · Junkun Peng · Zhi Wang

[ Exhibit Hall I ]

Abstract
Well-coordinated, music-aligned holistic dance enhances emotional expressiveness and audience engagement. However, generating such dances remains challenging due to the scarcity of holistic 3D dance datasets, the difficulty of achieving cross-modal alignment between music and dance, and the complexity of modeling interdependent motion across the body, hands, and face. To address these challenges, we introduce SoulDance, a high-precision music-dance paired dataset captured via professional motion capture systems, featuring meticulously annotated holistic dance movements. Building on this dataset, we propose SoulNet, a framework designed to generate music-aligned, kinematically coordinated holistic dance sequences. SoulNet consists of three principal components: (1) Hierarchical Residual Vector Quantization, which models complex, fine-grained motion dependencies across the body, hands, and face; (2) Music-Aligned Generative Model, which composes these hierarchical motion units into expressive and coordinated holistic dance; (3) Music-Motion Retrieval Module, a pre-trained cross-modal model that functions as a music-dance alignment prior, ensuring temporal synchronization and semantic coherence between generated dance and input music throughout the generation process. Extensive experiments demonstrate that SoulNet significantly surpasses existing approaches in generating high-quality, music-coordinated, and well-aligned holistic 3D dance sequences. Additional resources are available on our project: https://anonymous.4open.science/w/SoulDance-BBD3/
Poster
Guanyi Qin · Ziyue Wang · Daiyun Shen · Haofeng Liu · Hantao Zhou · Junde Wu · Runze Hu · Yueming Jin

[ Exhibit Hall I ]

Abstract
Given an object mask, Semi-supervised Video Object Segmentation (SVOS) technique aims to track and segment the object across video frames, serving as a fundamental task in computer vision. Although recent memory-based methods demonstrate potential, they often struggle with scenes involving occlusion, particularly in handling object interactions and high feature similarity. To address these issues and meet the real-time processing requirements of downstream applications, in this paper, we propose a novel bOundary Amendment video object Segmentation method with Inherent Structure refinement, hereby named OASIS. Specifically, a lightweight structure refinement module is proposed to enhance segmentation accuracy. With the fusion of rough edge priors captured by the Canny filter and stored object features, the module can generate an object-level structure map and refine the representations by highlighting boundary features. Evidential learning for uncertainty estimation is introduced to further address challenges in occluded regions. The proposed method, OASIS, maintains an efficient design, yet extensive experiments on challenging benchmarks demonstrate its superior performance and competitive inference speed compared to other state-of-the-art methods, i.e., achieving the F values of 91.6 (vs. 89.7 on DAVIS-17 validation set) and G values of 86.6 (vs. 86.2 on YouTubeVOS 2019 validation set) while maintaining a competitive speed of 48 …
Poster
Haotian Dong · Xin WANG · Di Lin · Yipeng Wu · Qin Chen · Ruonan Liu · Kairui Yang · Ping Li · Qing Guo

[ Exhibit Hall I ]

Abstract
High-quality video generation is crucial for many fields, including the film industry and autonomous driving. However, generating videos with spatiotemporal consistencies remains challenging. Current methods typically utilize attention mechanisms or modify noise to achieve consistent videos, neglecting global spatiotemporal information that could help ensure spatial and temporal consistency during video generation. In this paper, we propose the ***NoiseController***, consisting of **Multi-Level Noise Decomposition**, **Multi-Frame Noise Collaboration**, and **Joint Denoising**, to enhance spatiotemporal consistencies in video generation. In multi-level noise decomposition, we first decompose initial noises into scene-level foreground/background noises, capturing distinct motion properties to model multi-view foreground/background variations. Furthermore, each scene-level noise is further decomposed into individual-level shared and residual components. The shared noise preserves consistency, while the residual component maintains diversity. In multi-frame noise collaboration, we introduce an inter-view spatiotemporal collaboration matrix and an intra-view impact collaboration matrix, which captures mutual cross-view effects and historical cross-frame impacts to enhance video quality. The joint denoising contains two parallel denoising U-Nets to remove each scene-level noise, mutually enhancing video generation. We evaluate our ***NoiseController*** on public datasets focusing on video generation and downstream tasks, demonstrating its state-of-the-art performance.
Poster
Jingting Li · Yu Qian · Lin Zhao · Su-Jing Wang

[ Exhibit Hall I ]

Abstract
Micro-expressions (MEs) are brief, low-intensity, often localized facial expressions. They could reveal genuine emotions individuals may attempt to conceal, valuable in contexts like criminal interrogation and psychological counseling. However, ME recognition (MER) faces challenges, such as small sample sizes and subtle features, which hinder efficient modeling. Additionally, real-world applications encounter ME data privacy issues, leaving the task of enhancing recognition across settings under privacy constraints largely unexplored. To address these issues, we propose a FED-PsyAU research framework. We begin with a psychological study on the coordination of upper and lower facial action units (AUs) to provide structured prior knowledge of facial muscle dynamics. We then develop a DPK-GAT network that combines these psychological priors with statistical AU patterns, enabling hierarchical learning of facial motion features from regional to global levels, effectively enhancing MER performance. Additionally, our federated learning framework advances MER capabilities across multiple clients without data sharing, preserving privacy and alleviating the limited-sample issue for each client. Extensive experiments on commonly-used ME databases demonstrate the effectiveness of our approach.
Poster
Yuechen Zhang · YaoYang Liu · Bin Xia · Bohao PENG · Zexin Yan · Eric Lo · Jiaya Jia

[ Exhibit Hall I ]

Abstract
We present MagicMirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion. While recent advances in video diffusion models have shown impressive capabilities in text-to-video generation, maintaining consistent identity while producing natural motion remains challenging. Previous methods either require person-specific fine-tuning or struggle to balance identity preservation with motion diversity. Built upon Video Diffusion Transformers, our method introduces three key components: (1) a dual-branch facial feature extractor that captures both identity and structural features, (2) a lightweight cross-modal adapter with Conditioned Adaptive Normalization for efficient identity integration, and (3) a two-stage training strategy combining synthetic identity pairs with video data. Extensive experiments demonstrate that MagicMirror effectively balances identity consistency with natural motion, outperforming existing methods across multiple metrics while requiring minimal parameters added. The code and model will be made publicly available.
Poster
Yikun Ma · Yiqing Li · Jiawei Wu · Xing Luo · Zhi Jin

[ Exhibit Hall I ]

Abstract
Generative models have made remarkable advancements and are capable of producing high-quality content. However, performing controllable editing with generative models remains challenging, due to their inherent uncertainty in outputs. This challenge is praticularly pronounced in motion editing, which involves the processing of spatial information. While some physics-based generative methods have attempted to implement motion editing, they typically operate on single-view images with simple motions, such as translation and dragging. These methods struggle to handle complex rotation and stretching motions and ensure multi-view consistency, often necessitating resource-intensive retraining. To address these challenges, we propose MotionDiff, a training-free zero-shot diffusion method that leverages optical flow for complex multi-view motion editing. Specifically, given a static scene, users can interactively select objects of interest to add motion priors. The proposed Point Kinematic Model (PKM) then estimates corresponding multi-view optical flows during the Multi-view Flow Estimation Stage (MFES). Subsequently, these optical flows are utilized to generate multi-view motion results through decoupled motion representation in the Multi-view Motion Diffusion Stage (MMDS). Extensive experiments demonstrate that MotionDiff outperforms other physics-based generative motion editing methods in achieving high-quality multi-view consistent motion results. Notably, MotionDiff does not require retraining, enabling users to conveniently adapt it for various down-stream tasks.
Poster
Yue Su · Xinyu Zhan · Hongjie Fang · Han Xue · Hao-Shu Fang · Yong-Lu Li · Cewu Lu · Lixin Yang

[ Exhibit Hall I ]

Abstract
Mainstream visuomotor policies predominantly rely on generative models for holistic action prediction, while current autoregressive policies, predicting the next token or chunk, have shown suboptimal results. This motivates a search for more effective learning methods to unleash the potential of autoregressive policies for robotic manipulation. This paper introduces a bidirectionally expanded learning approach, termed Dense Policy, to establish a new paradigm for autoregressive policies in action prediction. It employs a lightweight encoder-only architecture to iteratively unfold the action sequence from an initial single frame into the target sequence in a coarse-to-fine manner with logarithmic-time inference. Extensive experiments validate that our dense policy has superior autoregressive learning capabilities and can surpass existing holistic generative policies. Our policy, example data, and training code will be publicly available upon publication.
Poster
Clinton A Mo · Kun Hu · Chengjiang Long · Dong Yuan · Wan-Chi Siu · Zhiyong Wang

[ Exhibit Hall I ]

Abstract
The motion skeleton is a core data structure of 3D animation workflows, producing character motions by posing a pre-defined bone hierarchy. Motion data is largely incompatible across skeletons with proportional and/or hierarchical differences, raising long-standing challenges for data-driven motion synthesis. To address this, Temporal Point Clouds (TPC) have emerged as a universal, cross-compatible motion representation, using temporally consistent points that map motion trajectories. While TPCs have demonstrated reversibility with skeletal motions, their role is currently limited to enabling cross-compatibility, whereas we believe motion tasks can be learned directly in the TPC medium. This would require TPC motion synthesis capabilities, which is an unexplored field due to its unique temporal consistency and point identity requirements.In this paper, we propose PUMPS, the primordial auto-encoder architecture for TPC data. It reduces point cloud frames independently into sampleable feature vectors, from which a decoder efficiently extracts distinct temporal points using latent Gaussian noise vectors as sampling identifiers. We introduce linear assignment-based point pairing to optimise the TPC reconstruction process without requiring expensive point-wise attention mechanisms in the architecture. Using the auto-encoder, we produce a pre-trained motion synthesis model capable of performing motion prediction, transition generation, and keyframe interpolation tasks. PUMPS performs remarkably well even …
Poster
Baoyou Chen · Ce Liu · Weihao Yuan · Zilong Dong · Siyu Zhu

[ Exhibit Hall I ]

Abstract
Video face restoration faces a critical challenge in maintaining temporal consistency while recovering fine facial details from degraded inputs. This paper presents a novel approach that extends Vector-Quantized Variational Autoencoders (VQ-VAEs), pretrained on static high-quality portraits, into a video restoration framework through variational latent space modeling. Our key innovation lies in reformulating discrete codebook representations as Dirichlet-distributed continuous variables, enabling probabilistic transitions between facial features across frames. A spatio-temporal Transformer architecture jointly models inter-frame dependencies and predicts latent distributions, while a Laplacian-constrained reconstruction loss combined with perceptual (LPIPS) regularization enhances both pixel accuracy and visual quality. Comprehensive evaluations on blind face restoration, video inpainting, and facial colorization tasks demonstrate state-of-the-art performance. This work establishes an effective paradigm for adapting intensive image priors, pretrained on high-quality images, to video restoration while addressing the critical challenge of flicker artifacts.
Poster
Juan Hu · Shaojing Fan · Terence Sim

[ Exhibit Hall I ]

Abstract
Multi-face deepfake videos are becoming increasingly prevalent, often appearing in natural social settings that challenge existing detection methods. Most current approaches excel at single-face detection but struggle in multi-face scenarios, due to a lack of awareness of crucial contextual cues. In this work, we develop a novel approach that leverages human cognition to analyze and defend against multi-face deepfake videos. Through a series of human studies, we systematically examine how people detect deepfake faces in social settings. Our quantitative analysis reveals four key cues humans rely on: scene-motion coherence, inter-face appearance compatibility, interpersonal gaze alignment, and face-body consistency. Guided by these insights, we introduce \textsf{HICOM}, a novel framework designed to detect every fake face in multi-face scenarios. Extensive experiments on benchmark datasets show that \textsf{HICOM} improves average accuracy by 3.3\% in in-dataset detection and 2.8\% under real-world perturbations. Moreover, it outperforms existing methods by 5.8\% on unseen datasets, demonstrating the generalization of human-inspired cues. \textsf{HICOM} further enhances interpretability by incorporating an LLM to provide human-readable explanations, making detection results more transparent and convincing. Our work sheds light on involving human factors to enhance defense against deepfakes.
Poster
Constantin Patsch · Yuankai Wu · Marsil Zakour · Driton Salihu · Eckehard Steinbach

[ Exhibit Hall I ]

Abstract
Online mistake detection is crucial across various domains, ranging from industrial automation to educational applications, as mistakes can be corrected by the human operator after their detection due to the continuous inference on a video stream. While prior research mainly addresses procedural errors that often relate to temporal and ordering information, identifying a broader range of error types is essential for real-world implementation. In this work, we present MistSense, an approach for online mistake identification that includes this versatility by considering both procedural errors, which involve incorrect action sequences, and execution errors, such as motor inaccuracies or improper equipment use. Our method integrates RGB and hand pose features to capture fine-grained contextual cues in order to detect a mistake. By jointly modeling spatial and sequential aspects of human actions, our framework enables robust and adaptive error detection in dynamic environments. Once a mistake has been detected, we leverage a large language model (LLM) which provides an error explanation that gives the user further insights into why an action has been identified as a mistake. The evaluation on common mistake detection benchmarks shows the effectiveness of our approach.
Poster
Arthur Josi · Luiz Gustavo Hafemann · Abdallah Dib · Emeline Got · Rafael M. O. Cruz · Marc-André Carbonneau

[ Exhibit Hall I ]

Abstract
Monocular facial performance capture in-the-wild is challenging due to varied capture conditions, face shapes, and expressions. Most current methods rely on linear 3D Morphable Models, which represent facial expressions independently of identity at the vertex displacement level. We propose SEREP (Semantic Expression Representation), a model that disentangles expression from identity at the semantic level. We start by learning an expression representation from high quality 3D data of unpaired facial expressions. Then, we train a model to predict expression from monocular images relying on a novel semi-supervised scheme using low quality synthetic data. In addition, we introduce MultiREX, a benchmark addressing the lack of evaluation resources for the expression capture task. Our experiments show that SEREP outperforms state-of-the-art methods, capturing challenging expressions and transferring them to new identities.
Poster
Mengyu Yang · Yiming Chen · Haozheng Pei · Siddhant Agarwal · Arun Vasudevan · James Hays

[ Exhibit Hall I ]

Abstract
Can a model distinguish between the sound of a spoon hitting a hardwood floor versus a carpeted one? Everyday object interactions produce sounds unique to the objects involved. We introduce the sounding object detection task to evaluate a model's ability to link these sounds to the objects directly responsible. Inspired by human perception, our multimodal object-aware framework learns from in-the-wild egocentric videos. Our model enforces object-awareness by using a slot attention visual encoder. We then develop an automatic method to compute segmentation masks of the objects involved to guide the model's focus towards the most informative regions of the interaction. We demonstrate state of the art performance on our new task along with existing multimodal action understanding tasks.
Poster
Xunpeng Yi · yibing zhang · Xinyu Xiang · Qinglong Yan · Han Xu · Jiayi Ma

[ Exhibit Hall I ]

Abstract
Current advanced research on infrared and visible image fusion primarily focuses on improving fusion performance, often neglecting the applicability on real-time fusion devices. In this paper, we propose a novel approach that towards extremely fast fusion via distillation to learnable lookup tables specifically designed for image fusion, termed as LUT-Fuse. Firstly, we develop a look-up table structure that utilizing low-order approximation encoding and high-level joint contextual scene encoding, which is well-suited for multi-modal fusion. Moreover, given the lack of ground truth in multi-modal image fusion, we naturally proposed the efficient LUT distillation strategy instead of traditional quantization LUT methods. By integrating the performance of the multi-modal fusion network (MM-Net) into the MM-LUT model, our method achieves significant breakthroughs in efficiency and performance. It typically requires less than one-tenth of the time compared to the current lightweight SOTA fusion algorithms, ensuring high operational speed across various scenarios, even in low-power mobile devices. Extensive experiments validate the superiority, reliability, and stability of our fusion approach. The code will be made publicly available.
Poster
Yisu Zhang · Chenjie Cao · Chaohui Yu · Jianke Zhu

[ Exhibit Hall I ]

Abstract
Video Diffusion Models (VDMs) have demonstrated remarkable capabilities in synthesizing realistic videos by learning from large-scale data. Although vanilla Low-Rank Adaptation (LoRA) can learn specific spatial or temporal movement to driven VDMs with constrained data, achieving precise control over both camera trajectories and object motion remains challenging due to the unstable fusion and non-linear scalability. To address these issues, we propose LiON-LoRA, a novel framework that rethinks LoRA fusion through three core principles: Linear scalability, Orthogonality, and Norm consistency. First, we analyze the orthogonality of LoRA features in shallow VDM layers, enabling decoupled low-level controllability. Second, norm consistency is enforced across layers to stabilize fusion during complex camera motion combinations. Third, a controllable token is integrated into the diffusion transformer (DiT) to linearly adjust motion amplitudes for both cameras and objects with a modified self-attention mechanism to ensure decoupled control. Additionally, we extend LiON-LoRA to temporal generation by leveraging static-camera videos, unifying spatial and temporal controllability. Experiments demonstrate that LiON-LoRA outperforms state-of-the-art methods in trajectory control accuracy and motion strength adjustment, achieving superior generalization with minimal training data.
Poster
Zhuo Li · Mingshuang Luo · RuiBing Hou · XIN ZHAO · Hao Liu · Hong Chang · Zimo Liu · Chen Li

[ Exhibit Hall I ]

Abstract
Human motion generation has been widely studied due to its crucial role in areas such as digital humans and humanoid robot control. However, many current motion generation approaches disregard physics constraints, frequently resulting in physically implausible motions with pronounced artifacts such as floating and foot sliding. Meanwhile, training an effective motion physics optimizer with noisy motion data remains largely unexplored.In this paper, we propose Morph, a Motion-Free physics optimization framework, consisting of a Motion Generator and a Motion Physics Refinement module, for enhancing physical plausibility without relying on expensive real-world motion data. Specifically, the motion generator is responsible for providing large-scale synthetic, noisy motion data, while the motion physics refinement module utilizes these synthetic data to learn a motion imitator within a physics simulator, enforcing physical constraints to project the noisy motions into a physically-plausible space. Additionally, we introduce a prior reward module to enhance the stability of the physics optimization process and generate smoother and more stable motions. These physically refined motions are then used to fine-tune the motion generator, further enhancing its capability. This collaborative training paradigm enables mutual enhancement between the motion generator and the motion physics refinement module, significantly improving practicality and robustness in real-world applications.Experiments …
Poster
Kaidong Zhang · Rongtao Xu · Ren Pengzhen · Junfan Lin · Hefeng Wu · Liang Lin · Xiaodan Liang

[ Exhibit Hall I ]

Abstract
Operating robots in open-ended scenarios with diverse tasks is a crucial research and application direction in robotics. While recent progress in natural language processing and large multimodal models has enhanced robots' ability to understand complex instructions, robot manipulation still faces the procedural skill dilemma and the declarative skill dilemma in open environments. Existing methods often compromise cognitive and executive capabilities. To address these challenges, in this paper, we propose RoBridge, a hierarchical intelligent architecture for general robotic manipulation. It consists of a high-level cognitive planner (HCP) based on a large-scale pre-trained vision-language model (VLM), an invariant operable representation (IOR) serving as a symbolic bridge, and a generalist embodied agent (GEA). RoBridge maintains the declarative skill of VLM and unleashes the procedural skill of reinforcement learning, effectively bridging the gap between cognition and execution. RoBridge demonstrates significant performance improvements over existing baselines, achieving a 75% success rate on new tasks and an 83% average success rate in sim-to-real generalization using only five real-world data samples per task. This work represents a significant step towards integrating cognitive reasoning with physical execution in robotic systems, offering a new paradigm for general robotic manipulation.
Poster
Jungbin Cho · Junwan Kim · Jisoo Kim · Minseo Kim · Mingu Kang · Sungeun Hong · Tae-Hyun Oh · Youngjae Yu

[ Exhibit Hall I ]

Abstract
Human motion is inherently continuous and dynamic, posing significant challenges for generative models. While discrete generation methods are widely used, they suffer from limited expressiveness and frame-wise noise artifacts. In contrast, continuous approaches produce smoother, more natural motion but often struggle to adhere to conditioning signals due to high-dimensional complexity and limited training data. To resolve this discord between discrete and continuous representations we introduce DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a novel method that leverages rectified flow to decode discrete motion tokens in the continuous, raw motion space. Our core idea is to frame token decoding as a conditional generation task, ensuring that DisCoRD captures fine-grained dynamics and achieves smoother, more natural motions. Compatible with any discrete-based framework, our method enhances naturalness without compromising faithfulness to the conditioning signals on diverse settings. Extensive evaluations demonstrate that DisCoRD achieves state-of-the-art performance, with FID of 0.032 on HumanML3D and 0.169 on KIT-ML. These results establish DisCoRD as a robust solution for bridging the divide between discrete efficiency and continuous realism. Code and checkpoints will be released.
Poster
Syed Talal Wasim · Hamid Suleman · Olga Zatsarynna · Muzammal Naseer · Juergen Gall

[ Exhibit Hall I ]

Abstract
We present MixANT, a novel architecture for stochastic long-term dense anticipation of human activities. While recent State Space Models (SSMs) like Mamba have shown promise through input-dependent selectivity on three key parameters, the critical forget-gate ($\textbf{A}$ matrix) controlling temporal memory remains static. We address this limitation by introducing a mixture of experts approach that dynamically selects contextually relevant $\textbf{A}$ matrices based on input features, enhancing representational capacity without sacrificing computational efficiency. Extensive experiments on the 50Salads, Breakfast, and Assembly101 datasets demonstrate that MixANT consistently outperforms state-of-the-art methods across all evaluation settings. Our results highlight the importance of input-dependent forget-gate mechanisms for reliable prediction of human behavior in diverse real-world scenarios.
Poster
Kim Sung-Bin · Jeongsoo Choi · Puyuan Peng · Joon Chung Chung · Tae-Hyun Oh · David Harwath

[ Exhibit Hall I ]

Abstract
We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video features, ensuring that synthesized speech is time-synchronized and expressively aligned with facial movements while preserving natural prosody. To inject visual cues, we design adapters to align facial features with the NCLM token space and introduce audio-visual fusion layers to merge audio-visual information within the NCLM framework. Additionally, we curate CelebV-Dub, a new dataset of expressive, real-world videos specifically designed for automated video dubbing. Extensive experiments show that our model achieves high-quality, intelligible, and natural speech synthesis with accurate lip synchronization, outperforming existing methods in human perception and performing favorably in objective evaluations. We also adapt VoiceCraft-Dub for the video-to-speech task, demonstrating its versatility for various applications.
Poster
Thomas Kreutz · Max Mühlhäuser · Alejandro Sanchez Guinea

[ Exhibit Hall I ]

Abstract
Despite LiDAR (Light Detection and Ranging) being an effective privacy-preserving alternative to RGB cameras to perceive human activities, it remains largely underexplored in the context of multi-modal contrastive pre-training for human activity understanding (e.g., human activity recognition (HAR), retrieval, or person re-identification (RE-ID)). To close this gap, our work explores learning the correspondence between LiDAR point clouds, human skeleton poses, IMU data, and text in a joint embedding space. More specifically, we present DeSPITE, a \underline{\textbf{D}e}ep \underline{\textbf{S}}keleton-\underline{\textbf{P}}ointcloud-\underline{\textbf{I}}MU-\underline{\textbf{T}}ext \underline{\textbf{E}}mbedding model, which effectively learns a joint embedding space across these four modalities through noise contrastive estimation. At the heart of our empirical exploration, we have combined the existing LIPD and Babel datasets, which enabled us to synchronize data of all four modalities, allowing us to explore the learning of a new joint embedding space. Our experiments demonstrate novel human activity understanding tasks for point cloud sequences enabled through DeSPITE, including Skeleton$\leftrightarrow$Pointcloud$\leftrightarrow$IMU matching, retrieval, and temporal moment retrieval. Furthermore, we show that DeSPITE is an effective pre-training strategy for point cloud HAR through experiments in MSR-Action3D and HMPEAR.
Poster
Xianghan Meng · Zhengyu Tong · Zhiyuan Huang · Chun-Guang Li

[ Exhibit Hall I ]

Abstract
Human Motion Segmentation (HMS), which aims to partition videos into non-overlapping human motions, has attracted increasing research attention recently. Existing approaches for HMS are mainly dominated by subspace clustering methods, which are grounded on the assumption that high-dimensional temporal data align with a Union-of-Subspaces (UoS) distribution.However, the frames in video capturing complex human motions with cluttered backgrounds may not align well with the UoS distribution. In this paper, we propose a novel approach for HMS, named Temporal Rate Reduction Clustering ($\text{TR}^2\text{C}$), which jointly learns structured representations and affinity to segment the frame sequences in video.Specifically, the structured representations learned by $\text{TR}^2\text{C}$ maintain temporally consistent and align well with a UoS structure, which is favorable for the HMS task.We conduct extensive experiments on five benchmark HMS datasets and achieve state-of-the-art performances with different feature extractors.
Poster
Shengdong Han · Shangdong Yang · Yuxuan Li · Xin Zhang · Xiang Li · jian Yang · Ming-Ming Cheng · Yimian Dai

[ Exhibit Hall I ]

Abstract
Resolving closely-spaced small targets in dense clusters presents a significant challenge in infrared imaging, as the overlapping signals hinder precise determination of their quantity, sub-pixel positions, and radiation intensities. While deep learning has advanced the field of infrared small target detection, its application to closely-spaced infrared small targets has not yet been explored. This gap exists primarily due to the complexity of separating superimposed characteristics and the lack of an open-source infrastructure.In this work, we propose the Dynamic Iterative Shrinkage Thresholding Network (DISTA-Net), which reconceptualizes traditional sparse reconstruction within a dynamic framework.DISTA-Net adaptively generates convolution weights and thresholding parameters to tailor the reconstruction process in real time.To the best of our knowledge, DISTA-Net is the first deep learning model designed specifically for the unmixing of closely-spaced infrared small targets, achieving superior sub-pixel detection accuracy.Moreover, we have established the first open-source ecosystem to foster further research in this field. This ecosystem comprises three key components: (1) CSIST-100K, a publicly available benchmark dataset; (2) CSO-mAP, a custom evaluation metric for sub-pixel detection; and (3) GrokCSO, an open-source toolkit featuring DISTA-Net and other models, will be publicly available soon.
Poster
Pin-Hung Kuo · Jinshan Pan · Shao-Yi Chien · Ming-Hsuan Yang

[ Exhibit Hall I ]

Abstract
The Transformer architecture has excelled in NLP and vision tasks, but its self-attention complexity grows quadratically with image size, making high-resolution tasks computationally expensive. We introduce {\ours}, featuring Concerto Self-Attention (CSA) for image deblurring. CSA splits self-attention into global and local components while retaining partial information in additional dimensions, achieving linear complexity. A Cross-Dimensional Communication module enhances expressiveness by linearly combining attention maps. Additionally, our gated-dconv MLP merges the two-staged Transformer design into a single stage. Extensive evaluations show our method performs favorably against state-of-the-art works in deblurring, deraining, and JPEG artifact removal. Code and models will be publicly available.
Poster
Haejun Han · Hang Lu

[ Exhibit Hall I ]

Abstract
We propose ASCENT, a novel framework for tracking neurons in 3D fluorescence microscopy recordings without relying on manual track annotations. ASCENT leverages self-supervised contrastive learning to learn robust, discriminative embeddings from detected neuron candidates. At its core is a volume compression module that transforms full 3D volumetric data into an efficient 2D representation by iteratively projecting along the z-axis and integrating positional information. This compressed representation is processed by a deep encoder (e.g., ResNet or Vision Transformer) to yield robust feature vectors that capture both appearance and spatial relationships among neurons. Extensive experiments on both in-house and public datasets demonstrate that ASCENT achieves state-of-the-art tracking performance with fast inference speed while removing the need for costly manual labeling and heavy pre- and post-processing. Our results suggest that this approach provides a scalable solution for 3D neuron tracking and holds promise for applications such as inter-individual neuron identity matching and demixing overlapping cells.
Poster
Wenjie Zhuo · Fan Ma · Hehe Fan

[ Exhibit Hall I ]

Abstract
We introduce InfiniDreamer, a novel framework for arbitrarily long human motion generation. Existing motion generation methods are often constrained to short sequences due to the lack of long motion training data. To overcome this, InfiniDreamer first generates sub-motions corresponding to each textual description and assembles them into a coarse long sequence using randomly initialized transition segments. To refine the entire motion, we propose Segment Score Distillation (SSD)—an optimization-based method that leverages a motion prior trained solely on short clips, enabling long-sequence generation without additional training. Specifically, SSD iteratively refines overlapping short segments sampled from the coarsely extended long motion sequence, progressively aligning them with the pre-trained motion diffusion prior. This process ensures local coherence within each segment, while the refined transitions between segments maintain global consistency across the entire sequence. Extensive qualitative and quantitative experiments validate the superiority of our framework, showcasing its ability to generate coherent, contextually aware motion sequences of arbitrary length.
Poster
Taekyung Ki · Dongchan Min · Gyeongsu Chae

[ Exhibit Hall I ]

Abstract
With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. Instead of a pixel-based latent space, we take advantage of a learned orthogonal motion latent space, enabling efficient generation and editing of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with an effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.
Poster
Phu Tran Dinh · Hung Dao · Daeyoung Kim

[ Exhibit Hall I ]

Abstract
Video super-resolution remains a significant challenge in low-level vision tasks. To date, CNN- and Transformer-based methods have delivered impressive results. However, CNNs are limited by local receptive fields, while Transformers struggle with quadratic complexity, posing challenges for processing long sequences in VSR. Recently, Mamba has drawn attention for its long-sequence modeling, linear complexity, and large receptive fields. In this work, we propose VSRM, a novel Video Super-Resolution framework that leverages the power of Mamba. VSRM introduces Spatial-to-Temporal Mamba and Temporal-to-Spatial Mamba blocks to extract long-range spatio-temporal features and enhance receptive fields efficiently. To better align adjacent frames, we propose Deformable Cross-Mamba Alignment module. This module utilizes a deformable cross-mamba mechanism to make the compensation stage more dynamic and flexible, preventing feature distortions. Finally, we minimize the frequency domain gaps between reconstructed and ground-truth frames by proposing a simple yet effective Frequency Charbonnier-like loss that better preserves high-frequency content and enhances visual quality. Through extensive experiments, VSRM achieves state-of-the-art results on diverse benchmarks, establishing itself as a solid foundation for future research.
Poster
Zhidan Xu · Xiaoqin Zhang · Shijian Lu

[ Exhibit Hall I ]

Abstract
Face retouching has achieved impressive performance largely driven by its wide range of applications in various real-world tasks. However, most existing works often encounters a dilemma between global consistency and local detail preservation, partially due to the lack of large-scale and high-quality training data. We address the face retouching challenge from two perspectives. First, we create a large-scale face retouching benchmark to mitigate the data scarcity issue. The benchmark comprises 25,000 pairs of high-quality facial images (before and after face retouching) that contain a variety of facial attributes and blemish types such as acne and moles. Second, we design a novel framework that introduces frequency selection and restoration (FSR) and multi-resolution fusion (MRF) that leverages frequency-aware dynamic aggregation and spatial-frequency filtering to achieve global consistency and local detail preservation concurrently. Inspired by the principle of JPEG compression, FSR introduces frequency-domain quantization with spatial projections to learn enhanced feature representations. MRF fuses multi-resolution features via laplacian pyramid fusion, removing large-area blemishes and preserving local fine details effectively. Extensive experiments over multiple benchmarks show that the proposed framework outperforms the state-of-the-art quantitatively and qualitatively. The created benchmark also provides valuable data for training and evaluating both existing and future face retouching networks.
Poster
Yufei Zhu · Hao Chen · Yongjian Deng · Wei You

[ Exhibit Hall I ]

Abstract
Traditional motion deblurring methods struggle to effectively model motion information within the exposure time. Recently, event cameras have attracted significant research interest for its ability to model motion cues over the exposure duration. However, these methods directly fuse event features with image, overlooking the intrinsic heterogeneity of events. In this paper, we identify that the event modality contains two conflicting types of information: edge features and motion cues. Events accumulated over a short exposure period capture sharp edge details but lose motion information, while those accumulated over a long exposure period blur edge details due to motion. To address this issue, we propose a simple yet effective approach to disentangle these two cues from event features and employ an edge-aware sharpening module along with motion-driven scale-adaptive deblurring module to fully leverage both. Specifically, the first module aids in restoring sharp edges by leveraging the clear edge features provided by events, while the second module leverages motion cues to learn diverse blur kernels, adaptively adjusting the receptive field for optimal deblurring. Extensive experiments on synthetic and real-world datasets validate the effectiveness of our approach and yield a substantial improvement over state-of-the-art single-frame methods and surpasses most multi-frame-based methods. Code will be …
Poster
Marvin Heidinger · Snehal Jauhri · Vignesh Prasad · Georgia Chalvatzaki

[ Exhibit Hall I ]

Abstract
When interacting with objects, humans effectively reason about which regions of objects are viable for an intended action, i.e., the affordance regions of the object. They can also account for subtle differences in object regions based on the task to be performed and whether one or two hands need to be used. However, current vision-based affordance prediction methods often reduce the problem to naive object part segmentation. In this work, we propose a framework for extracting affordance data from human activity video datasets. Our extracted 2HANDS dataset contains precise object affordance region segmentations and affordance class-labels as narrations of the activity performed. The data also accounts for bimanual actions, i.e., two hands co-ordinating and interacting with one or more objects. We present a VLM-based affordance prediction model, 2HandedAfforder, trained on the dataset and demonstrate superior performance over baselines in affordance region segmentation for various activities. Finally, we show that our predicted affordance regions are actionable, i.e., can be used by an agent performing a task, through demonstration in robotic manipulation scenarios.
Poster
Fabio De Sousa Ribeiro · Omar Todd · Charles Jones · Avinash Kori · Raghav Mehta · Ben Glocker

[ Exhibit Hall I ]

Abstract
We propose the Flow Stochastic Segmentation Network (Flow-SSN), a generative model for probabilistic segmentation featuring discrete-time autoregressive and modern continuous-time flow parameterisations. We prove fundamental limitations of the low-rank parameterisation of previous methods and show that Flow-SSNs can estimate arbitrarily high-rank pixel-wise covariances without assuming the rank or storing the distributional parameters. Flow-SSNs are also more efficient to sample from than standard diffusion-based segmentation models, as most of the model capacity is allocated to learning the base distribution of the flow, which constitutes an expressive prior. We apply Flow-SSNs to challenging medical imaging benchmarks and achieve state-of-the-art results.
Poster
Subrat Kishore Dutta · Xiao Zhang

[ Exhibit Hall I ]

Abstract
Despite modifying only a small localized input region, adversarial patches can drastically change the prediction of computer vision models. However, prior methods either cannot perform satisfactorily under targeted attack scenarios or fail to produce contextually coherent adversarial patches, causing them to be easily noticeable by human examiners and insufficiently stealthy against automatic patch defenses. In this paper, we introduce IAP, a novel attack framework that generates highly invisible adversarial patches based on perceptibility-aware localization and perturbation optimization schemes. Specifically, IAP first searches for a proper location to place the patch by leveraging classwise localization and sensitivity maps, balancing the susceptibility of patch location to both victim model prediction and human visual system, then employs a perceptibility-regularized adversarial loss and a gradient update rule that prioritizes color constancy for optimizing invisible perturbations. Comprehensive experiments across various image benchmarks and model architectures demonstrate that IAP consistently achieves competitive attack success rates in targeted settings with significantly improved patch invisibility compared to existing baselines. In addition to being highly imperceptible to humans, IAP is shown to be stealthy enough to render several state-of-the-art patch defenses ineffective.
Poster
Yi Wang · Zhitong Xiong · Chenying Liu · Adam Stewart · Thomas Dujardin · Nikolaos Ioannis Bountos · Angelos Zavras · Franziska Gerken · Ioannis Papoutsis · Laura Leal-Taixé · Xiao Xiang Zhu

[ Exhibit Hall I ]

Abstract
Advances in Earth observation (EO) foundation models have unlocked the potential of big satellite data to learn generic representations from space, benefiting a wide range of downstream applications crucial to our planet. However, most existing efforts remain limited to fixed spectral sensors, focus solely on the Earth's surface, and overlook valuable metadata beyond imagery. In this work, we take a step towards next-generation EO foundation models with three key components: 1) Copernicus-Pretrain, a massive-scale pretraining dataset that integrates 18.7M aligned images from all major Copernicus Sentinel missions, spanning from the Earth's surface to its atmosphere; 2) Copernicus-FM, a unified foundation model capable of processing any spectral or non-spectral sensor modality using extended dynamic hypernetworks and flexible metadata encoding; and 3) Copernicus-Bench, a systematic evaluation benchmark with 15 hierarchical downstream tasks ranging from preprocessing to specialized applications for each Sentinel mission. Our dataset, model, and benchmark greatly improve the scalability, versatility, and multimodal adaptability of EO foundation models, while also creating new opportunities to connect EO, weather, and climate research.
Poster
Weixi Zheng · Jingwang Ling · Zhibo Wang · Quan Wang · Feng Xu

[ Exhibit Hall I ]

Abstract
We present the first method for personalized dental shape reconstruction and teeth-inclusive facial performance capture using only a single phone camera. Our approach democratizes high-quality facial avatars through a non-invasive, low-cost setup by addressing the ill-posed monocular capture problem with an analysis-by-synthesis approach. We introduce a representation adaptation technique that maintains both mesh and SDF representations of teeth, enabling efficient differentiable rendering while preventing teeth-lip interpenetration. To overcome alignment challenges with similar-appearing dental components, we leverage foundation models for semantic teeth segmentation and design specialized optimization objectives. Our method addresses the challenging occlusions of teeth during facial performance through optimization strategies that leverage facial structural priors, while our semantic mask rendering loss with optimal transport-based matching ensures convergence despite significant variations in initial positioning. Code will be released.

Demonstration: Demos 3 Wed 22 Oct 10:45 a.m.  

  • Layered Diffusion Brushes, Peyman Gholami, Robert Xiao
  • Semantic navigation, Yue Hu
  • Demo for One-Step Specular Highlight Removal with Adapted Diffusion Models, Mahir Atmis, Levent Karacan, Mehmet Sarıgul
  • Image as an IMU, Jerred Chen, Ronald Clark

Session: Doctoral Consortium Wed 22 Oct 11:00 a.m.  


Oral 4B: 3D Pose Understanding Wed 22 Oct 01:00 p.m.  

Oral
Lennart Bastian · Mohammad Rashed · Nassir Navab · Tolga Birdal

[ Kalakaua Ballroom ]

Abstract
Modeling the rotation of moving objects is a fundamental task in computer vision, yet $SO(3)$ extrapolation still presents numerous challenges: (1) unknown quantities such as the moment of inertia complicate dynamics, (2) the presence of external forces and torques can lead to non-conservative kinematics, and (3) estimating evolving state trajectories under sparse, noisy observations requires robustness.We propose modeling trajectories of noisy pose estimates on the manifold of 3D rotations in a physically and geometrically meaningful way by leveraging Neural Controlled Differential Equations guided with $SO(3)$ Savitzky-Golay paths.Existing extrapolation methods often rely on energy conservation or constant velocity assumptions, limiting their applicability in real-world scenarios involving non-conservative forces. In contrast, our approach is agnostic to energy and momentum conservation while being robust to input noise, making it applicable to complex, non-inertial systems. Our approach is easily integrated as a module in existing pipelines and generalizes well to trajectories with unknown physical parameters.By learning to approximate object dynamics from noisy states during training, our model attains robust extrapolation capabilities in simulation and various real-world settings.
Oral
Carl Olsson · Yaroslava Lochman · Johan Malmport · Christopher Zach

[ Kalakaua Ballroom ]

Abstract
Rotation averaging is a key subproblem in applications of computer vision and robotics. Many methods for solving this problem exist, and there are also several theoretical results analyzing difficulty and optimality. However, one aspect that most of these have in common is a focus on the isotropic setting, where the intrinsic uncertainties in the measurements are not fully incorporated into the resulting optimization task. Recent empirical results suggest that moving to an anisotropic framework, where these uncertainties are explicitly included, can result in an improvement of solution quality. However, global optimization for rotation averaging has remained a challenge in this scenario.In this paper we show how anisotropic costs can be incorporated in certifiably optimal rotation averaging. We also demonstrate how existing solvers, designed for isotropic situations, fail in the anisotropic setting. Finally, we propose a stronger relaxation and show empirically that it is able to recover global optima in all tested datasets and leads to a more accurate reconstruction in all but one of the scenes.
Oral
Jinghao Wang · Zhang Li · Zi Wang · Banglei Guan · Yang Shang · Qifeng Yu

[ Kalakaua Ballroom ]

Abstract
Recently, 6D pose confidence region estimation has emerged as a critical direction, aiming to perform uncertainty quantification for assessing the reliability of estimated poses. However, current sampling-based approach suffers from critical limitations that severely impede their practical deployment: 1) the sampling speed significantly decreases as the number of samples increases. 2) the derived confidence regions are often excessively large. To address these challenges, we propose a deterministic and efficient method for estimating pose confidence regions. Our approach uses inductive conformal prediction to calibrate the deterministically regressed Gaussian keypoint distributions into 2D keypoint confidence regions. We then leverage the implicit function theorem to propagate these keypoint confidence regions directly into 6D pose confidence regions. This method avoids the inefficiency and inflated region sizes associated with sampling and ensembling, providing compact confidence regions that cover the ground-truth poses with a user-defined confidence level. Experimental results on the LineMOD Occlusion and SPEED datasets show that our method achieves higher pose estimation accuracy with reduced computational time. For the same coverage rate, our method yields significantly smaller confidence region volumes, reducing them by up to 99.9% for rotations and 99.8% for translations. The code will be available soon.
Oral
Yaqing Ding · Viktor Kocur · VACLAV VAVRA · Zuzana Berger Haladova · jian Yang · Torsten Sattler · Zuzana Kukelova

[ Kalakaua Ballroom ]

Abstract
Recent advances in monocular depth estimation methods (MDE) and their improved accuracy open new possibilities for their applications. In this paper, we investigate how monocular depth estimates can be used for relative pose estimation. In particular, we are interested in answering the question whether using MDEs improves results over traditional point-based methods. We propose a novel framework for estimating the relative pose of two cameras from point correspondences with associated monocular depths. Since depth predictions are typically defined up to an unknown scale or even both unknown scale and shift parameters, our solvers jointly estimate the scale or both the scale and shift parameters along with the relative pose. We derive efficient solvers considering different types of depths for three camera configurations: (1) calibrated cameras, (2) cameras with an unknown shared focal length, and (3) cameras with unknown different focal lengths. Our new solvers outperform state-of-the-art depth-aware solvers in terms of speed and accuracy. In extensive real experiments on multiple datasets and with various MDEs, we discuss which depth-aware solvers are preferable in which situation. The code will be made publicly available.
Oral
Chengtang Yao · Lidong Yu · Zhidan Liu · Jiaxi Zeng · Yuwei Wu · Yunde Jia

[ Kalakaua Ballroom ]

Abstract
The matching formulation makes it naturally hard for the stereo matching to handle ill-posed regions like occlusions and non-Lambertian surfaces. Fusing monocular priors has been proven helpful for ill-posed matching, but the biased monocular prior learned from small stereo datasets constrains the generalization. Recently, stereo matching has progressed by leveraging the unbiased monocular prior from the vision foundation model (VFM) to improve the generalization in ill-posed regions. We dive into the fusion process and observe three main problems limiting the fusion of the VFM monocular prior. The first problem is the misalignment between affine-invariant relative monocular depth and absolute depth of disparity. Besides, when we use the monocular feature in an iterative update structure, the over-confidence in the disparity update leads to local optima results. A direct fusion of a monocular depth map could alleviate the local optima problem, but noisy disparity results computed at the first several iterations will misguide the fusion. In this paper, we propose a binary local ordering map to guide the fusion, which converts the depth map into a binary relative format, unifying the relative and absolute depth representation. The computed local ordering map is also used to re-weight the initial disparity update, resolving the …

Oral 4A: Vision + graphics Wed 22 Oct 01:00 p.m.  

Oral
Zichen Liu · Yihao Meng · Hao Ouyang · Yue Yu · Bolin Zhao · Daniel Cohen-Or · Huamin Qu

[ Exhibit Hall III ]

Abstract
Text animation serves as an expressive medium, transforming static communication into dynamic experiences by infusing words with motion to evoke emotions, emphasize meanings, and construct compelling narratives. Crafting animations that are semantically aware poses significant challenges, demanding expertise in graphic design and animation. We present an automated text animation scheme, termed "Dynamic Typography", which deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts. The animation is represented by a canonical field that aggregates the semantic content in a canonical shape and a deformation field that applies per-frame motion to deform the canonical shape. Two fields are jointly optimized by the priors from a large pretrained text-to-video diffusion model using score-distillation loss with designed regularization, encouraging the video coherence with the intended textual concept while maintaining legibility and structural integrity throughout the animation process. We demonstrate the generalizability of our approach across various text-to-video models and highlight the superiority of our methodology over baselines. Through quantitative and qualitative evaluations, we demonstrate the effectiveness of our framework in generating coherent text animations that faithfully interpret user prompts while maintaining readability.
Oral
Ava Pun · Kangle Deng · Ruixuan Liu · Deva Ramanan · Changliu Liu · Jun-Yan Zhu

[ Exhibit Hall III ]

Abstract
We introduce LegoGPT, the first approach for generating physically stable LEGO brick models from text prompts. To achieve this, we construct a large-scale, physically stable dataset of LEGO designs, along with their associated captions, and train an autoregressive large language model to predict the next brick to add via next-token prediction. To improve the stability of the resulting designs, we employ an efficient validity check and physics-aware rollback during auto-regressive inference, which prunes infeasible token predictions using physics laws and assembly constraints. Our experiments show that LegoGPT produces stable, diverse, and aesthetically pleasing LEGO designs that align closely with the input text prompts. We also develop a text-based LEGO texturing method, enabling us to generate colored and textured designs. We show that our designs can be assembled by humans manually as well as by robotic arms automatically. Upon publication, we will release our new dataset, StableText2Lego, which contains over 47,000 LEGO structures of over 28,000 unique 3D objects accompanied by detailed captions, along with our code and models.
Oral
Richard Liu · Daniel Fu · Noah Tan · Itai Lang · Rana Hanocka

[ Exhibit Hall III ]

Abstract
In this work we present WIR3D, a technique for abstracting 3D shapes through a sparse set of visually meaningful curves in 3D. We optimize the parameters of Bezier curves such that they faithfully represent both the geometry and salient visual features (e.g. texture) of the shape from arbitrary viewpoints. We leverage the intermediate activations of a pre-trained foundation model (CLIP) to guide our optimization process. We divide our optimization into two phases: one for capturing the coarse geometry of the shape, and the other for representing fine-grained features. Our second phase supervision is spatially guided by a novel localized keypoint loss. This spatial guidance enables user control over abstracted features. We ensure fidelity to the original surface through a neural SDF loss, which allows the curves to be used as intuitive deformation handles. We successfully apply our method for shape abstraction over a broad dataset of shapes with varying complexity, geometric structure, and texture, and demonstrate downstream applications for feature control and shape deformation.
Oral
Xianglong He · Zi-Xin Zou · Chia Hao Chen · Yuan-Chen Guo · Ding Liang · Chun Yuan · Wanli Ouyang · Yanpei Cao · Yangguang Li

[ Exhibit Hall III ]

Abstract
Creating high-fidelity 3D meshes with arbitrary topology, including open surfaces and complex interiors, remains a significant challenge. Existing implicit field methods often require costly and detail-degrading watertight conversion, while other approaches struggle with high resolutions. This paper introduces SparseFlex, a novel sparse-structured isosurface representation that enables differentiable mesh reconstruction at resolutions up to 1024^3 directly from rendering losses. SparseFlex combines the accuracy of Flexicubes with a sparse voxel structure, focusing computation on surface-adjacent regions and efficiently handling open surfaces. Crucially, we introduce a frustum-aware sectional voxel training strategy that activates only relevant voxels during rendering, dramatically reducing memory consumption and enabling high-resolution training. This also allows, for the first time, the reconstruction of mesh interiors using only rendering supervision. Building upon this, we demonstrate a complete shape modeling pipeline by training a variational autoencoder (VAE) and a rectified flow transformer for high-quality 3D shape generation. Our experiments show state-of-the-art reconstruction accuracy, with a ~82% reduction in Chamfer Distance and a ~88% increase in F-score compared to previous methods, and demonstrate the generation of high-resolution, detailed 3D shapes with arbitrary topology. By enabling high-resolution, differentiable mesh reconstruction and generation with rendering losses, SparseFlex significantly advances the state-of-the-art in 3D shape representation …
Oral
Jianhong Bai · Menghan Xia · Xiao Fu · Xintao Wang · Lianrui Mu · Jinwen Cao · Zuozhu Liu · Haoji Hu · Xiang Bai · Pengfei Wan · Di ZHANG

[ Exhibit Hall III ]

Abstract
Camera control has been actively studied in text or image conditioned video generation tasks. However, altering camera trajectories of a given video remains under-explored, despite its importance in the field of video creation. It is non-trivial due to the extra constraints of maintaining multiple-frame appearance and dynamic synchronization. To address this, we present ReCamMaster, a camera-controlled generative video re-rendering framework that reproduces the dynamic scene of an input video at novel camera trajectories. The core innovation lies in harnessing the generative capabilities of pre-trained text-to-video models through an elegant yet powerful video conditioning mechanism—an aspect often overlooked in current research. To overcome the scarcity of qualified training data, we construct a comprehensive multi-camera synchronized video dataset using Unreal Engine 5, which is carefully curated to follow real-world filming characteristics, covering diverse scenes and camera movements. It helps the model generalize to in-the-wild videos. Lastly, we further improve the robustness to diverse inputs through a meticulously designed training strategy. Extensive experiments tell that our method substantially outperforms existing state-of-the-art approaches and strong baselines. Our method also finds promising applications in video stabilization, super-resolution, and outpainting. Our code and dataset will be publicly available.

Poster Session 4 & Exhibit Hall with Coffee Break Wed 22 Oct 02:30 p.m.  

Poster
Wenhao Wang · Yi Yang

[ Exhibit Hall I ]

Abstract
Video generation models are revolutionizing content creation, with image-to-video models drawing increasing attention due to their enhanced controllability, visual consistency, and practical applications. However, despite their popularity, these models rely on user-provided text and image prompts, and there is currently no dedicated dataset for studying these prompts. In this paper, we introduce **TIP-I2V**, the first large-scale dataset of over $1.70$ million unique user-provided **T**ext and **I**mage **P**rompts specifically for **I**mage-to-**V**ideo generation. Additionally, we provide the corresponding generated videos from five state-of-the-art image-to-video models. We begin by outlining the time-consuming and costly process of curating this large-scale dataset. Next, we compare TIP-I2V to two popular prompt datasets, VidProM (text-to-video) and DiffusionDB (text-to-image), highlighting differences in both basic and semantic information. This dataset enables advancements in image-to-video research. For instance, to develop better models, researchers can use the prompts in TIP-I2V to analyze user preferences and evaluate the multi-dimensional performance of trained models; and to enhance model safety, they may focus on addressing the misinformation issue caused by image-to-video models. The new research inspired by TIP-I2V and the differences with existing datasets emphasize the importance of a specialized image-to-video prompt dataset.The dataset is anonymously available at https://huggingface.co/datasets/tipi2v/TIP-I2V.
Poster
Anand Kumar · Jiteng Mu · Nuno Vasconcelos

[ Exhibit Hall I ]

Abstract
Text-to-image (T2I) models have gained widespread adoption among content creators and the general public. However, this has sparked significant concerns among artists regarding data privacy and copyright infringement. Gradually, there is an increasing demand for T2I models to incorporate mechanisms that prevent the generation of specific artistic styles, thereby safeguarding intellectual property rights. Existing methods for style extraction typically necessitate the collection of custom datasets and the training of specialized models. This, however, is resource-intensive, time-consuming, and often impractical for real-time applications. Moreover, it may not adequately address the dynamic nature of artistic styles and the rapidly evolving landscape of digital art. We present a novel, training-free framework to solve the style attribution problem, using the features produced by a diffusion model alone, without any external modules or retraining. This is denoted as Introspective Style attribution (IntroStyle) and is shown to perform superior to state-of-the-art models for style retrieval. We also introduce a synthetic Artistic Style Split (ArtSplit) dataset to isolate artistic style and evaluate fine-grained style attribution performance.
Poster
Pingchuan Ma · Xiaopei Yang · Ming Gui · Yusong Li · Felix Krause · Johannes Schusterbauer · Björn Ommer

[ Exhibit Hall I ]

Abstract
The human perception of style and content is inherently subjective and varies widely. Likewise, computer vision models learn diverse latent representations of these attributes. While generative models focus on stylization and content transfer, discriminative approaches aim to capture effective representations of style and content. However, explicitly defining these attributes remains inherently difficult. To address this, we propose a method that implicitly discovers style and content representations within a semantic-rich compact space, avoiding spatial token constraints. Leveraging flow matching, our framework effectively separates style and content without predefined definitions, offering a structured yet flexible representation that can be directly applied to any precomputed CLIP embeddings. To further facilitate this, we have curated a dataset of $510{,}000$ samples ($51$ styles $\times$ $10{,}000$ content samples) for training and evaluating our model. While our method provides a strong foundation for representation learning, it is also adaptable for controllable generation tasks. We demonstrated our implicitly learned style and content representations can generalize well to ImageNet-1k and WikiArt in a zero-shot fashion. We showcase promising visual results involving various styles and contents. \textit{We will release the code and the curated dataset.}
Poster
Jinpei Guo · Zheng Chen · Wenbo Li · Yong Guo · YULUN ZHANG

[ Exhibit Hall I ]

Abstract
Diffusion models have demonstrated remarkable success in image restoration tasks. However, their multi-step denoising process introduces significant computational overhead, limiting their practical deployment. Furthermore, existing methods struggle to effectively remove severe JPEG artifact, especially in highly compressed images. To address these challenges, we propose CODiff, a \textbf{c}ompression-aware \textbf{o}ne-step \textbf{diff}usion model for JPEG artifact removal. The core of CODiff is the compression-aware visual embedder (CaVE), which extracts and leverages JPEG compression priors to guide the diffusion model. We propose a dual learning strategy that combines explicit and implicit learning. Specifically, explicit learning enforces a quality prediction objective to differentiate low-quality images with different compression levels. Implicit learning employs a reconstruction objective that enhances the model's generalization. This dual learning allows for a deeper and more comprehensive understanding of JPEG compression. Experimental results demonstrate that CODiff surpasses recent leading methods in both quantitative and visual quality metrics. The code and models will be released.
Poster
Zhenxiong Tan · Songhua Liu · Xingyi Yang · Qiaochu Xue · Xinchao Wang

[ Exhibit Hall I ]

Abstract
We present OminiControl, a novel approach that rethinks how image conditions are integrated into Diffusion Transformer (DiT) architectures. Current image conditioning methods either introduce substantial parameter overhead or handle only specific control tasks effectively, limiting their practical versatility. OminiControl addresses these limitations through three key innovations: (1) a minimal architectural design that leverages the DiT's own VAE encoder and transformer blocks, requiring just 0.1\% additional parameters; (2) a unified sequence processing strategy that combines condition tokens with image tokens for flexible token interactions; and (3) a dynamic position encoding mechanism that adapts to both spatially-aligned and non-aligned control tasks. Our extensive experiments show that this streamlined approach not only matches but surpasses the performance of specialized methods across multiple conditioning tasks. To overcome data limitations in subject-driven generation, we also introduce Subjects200K, a large-scale dataset of identity-consistent image pairs synthesized using DiT models themselves. This work demonstrates that effective image control can be achieved without architectural complexity, opening new possibilities for efficient and versatile image generation systems.
Poster
Lijie Liu · Tianxiang Ma · Bingchuan Li · Zhuowei Chen · Jiawei Liu · Gen Li · SiYu Zhou · Qian HE · Xinglong Wu

[ Exhibit Hall I ]

Abstract
The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent videos following textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single- and multi-subject references.Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. The proposed method achieves perfect subject-consistent video generation while addressing issues of image content leakage and multi-subject confusion.Evaluation results indicate that our method outperforms other state-of-the-art closed-source commercial solutions.In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages.
Poster
Haoyang Xu · Tianhao Zhao · Sibei Yang · Yutian Lin

[ Exhibit Hall I ]

Abstract
Diffusion models have emerged as a powerful technique for text-to-image (T2I) generation, creating high-quality, diverse images across various domains. However, a common limitation in these models is the incomplete display of objects, where fragments or missing parts can undermine the model's performance in downstream applications such as dataset synthesis and video generation using 2D prior-based models. % that demand visual accuracy, such as e-commerce product imaging and realistic digital content creation.In this study, we conduct the in-depth analysis of this issue and reveal that the primary culprit behind incomplete object generation is $\textit{RandomCrop}$. This data augmentation method, widely used in diffusion models, though enhances model generalization ability, disrupts object continuity during training. To address this, we propose a training-free solution that penalizes activation values occurring at image boundaries during the early denoising steps. Our method is easily applicable to pre-trained Stable Diffusion models with minimal modifications and negligible computational overhead. Extensive experiments demonstrate the effectiveness of our method, showing substantial improvements in object integrity and image quality.
Poster
Alejandro Pardo · Fabio Pizzati · Tong Zhang · Alexander Pondaven · Philip Torr · Juan Perez · Bernard Ghanem

[ Exhibit Hall I ]

Abstract
Match-cuts are powerful cinematic tools that create seamless transitions between scenes, delivering strong visual and metaphorical connections. However, crafting impactful match-cuts is a challenging and resource-intensive process that requires deliberate artistic planning throughout the production pipeline. In this work, we introduce MatchDiffusion, a training-free method that uses text-to-video diffusion models to automatically generate match-cuts. As such, MatchDiffusion is the first method for match-cut generation. Our method leverages an inherent property of diffusion models, whereby the early denoising steps determine the broad appearance of the scene, while the latter steps add details. Motivated by this property, MatchDiffusion first performs "Joint Diffusion", by initializing generation for two prompts from a shared noise sample, and following a shared denoising path for the first denoising steps.This process results in the two videos sharing structural and motion characteristics. After Joint Diffusion, we then conduct "Disjoint Diffusion", allowing the videos' denoising paths to diverge and introduce their unique details. MatchDiffusion thus yields visually coherent videos that are amenable to match-cuts. We demonstrate the effectiveness of our method through user studies and metrics, showing its potential to democratize match-cut creation.
Poster
Zhengyao Lyu · Chenyang Si · Tianlin Pan · Zhaoxi Chen · Kwan-Yee K. Wong · Yu Qiao · Ziwei Liu

[ Exhibit Hall I ]

Abstract
Diffusion Models have achieved remarkable results in video synthesis but require iterative denoising steps, leading to substantial computational overhead. Consistency Models have made significant progress in accelerating diffusion models. However, directly applying them to video diffusion models often results in severe degradation of temporal consistency and appearance details. In this paper, by analyzing the training dynamics of Consistency Models, we identify a key conflicting learning dynamics during the distillation process: there is a significant discrepancy in the optimization gradients and loss contributions across different timesteps. This discrepancy prevents the distilled student model from achieving an optimal state, leading to compromised temporal consistency and degraded appearance details.To address this issue, we propose a parameter-efficient \textbf{Dual-Expert Consistency Model~(DCM)}, where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement. Furthermore, we introduce Temporal Coherence Loss to improve motion consistency for the semantic expert and apply GAN and Feature Matching Loss to enhance the synthesis quality of the detail expert. Our approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation. Our code and models will be made publicly available.
Poster
Yijing Lin · Mengqi Huang · Shuhan Zhuang · Zhendong Mao

[ Exhibit Hall I ]

Abstract
Unifying diverse image generation tasks within a single framework remains a fundamental challenge in visual generation. While large language models (LLMs) achieve unification through task-agnostic data and generation, existing visual generation models fail to meet these principles. Current approaches either rely on per-task datasets and large-scale training or adapt pre-trained image models with task-specific modifications, limiting their generalizability. In this work, we explore video models as a foundation for unified image generation, leveraging their inherent ability to model temporal correlations. We introduce RealGeneral, a novel framework that reformulates image generation as a conditional frame prediction task, analogous to in-context learning in LLMs. To bridge the gap between video models and condition-image pairs, we propose (1) a Unified Conditional Embedding module for multi-modal alignment and (2) a Unified Stream DiT Block with decoupled adaptive LayerNorm and attention mask to mitigate cross-modal interference. RealGeneral demonstrates effectiveness in multiple important visual generation tasks, \eg, it achieves a 14.5\% improvement in subject similarity for customized generation and a 10\% enhancement in image quality for canny-to-image task.
Poster
Jimin Dai · Jiexi Yan · Jian Yang · lei luo

[ Exhibit Hall I ]

Abstract
The Reflow operation aims to straighten the inference trajectories of the rectified flow during training by constructing deterministic couplings between noises and images, thereby improving the quality of generated images in single-step or few-step generation. However, we identify critical limitations in Reflow, particularly its inability to rapidly generate high-quality images due to a distribution gap between images in its constructed deterministic couplings and real images. To address these shortcomings, we propose a novel alternative called Straighten Viscous Rectified Flow via Noise Optimization (VRFNO), which is a joint training framework integrating an encoder and a neural velocity field. VRFNO introduces two key innovations: (1) a historical velocity term that enhances trajectory distinction, enabling the model to more accurately predict the velocity of the current trajectory, and (2) the noise optimization through reparameterization to form optimized couplings with real images which are then utilized for training, effectively mitigating errors caused by Reflow's limitations. Comprehensive experiments on synthetic data and real datasets with varying resolutions show that VRFNO significantly mitigates the limitations of Reflow, achieving state-of-the-art performance in both one-step and few-step generation tasks.
Poster
Yuekun Dai · Haitian Li · Shangchen Zhou · Chen Change Loy

[ Exhibit Hall I ]

Abstract
RGBA images, with the additional alpha channel, are crucial for any application that needs blending, masking, or transparency effects, making them more versatile than standard RGB images. Nevertheless, existing image inpainting methods are designed exclusively for RGB images. Conventional approaches to transparent image inpainting typically involve placing a background underneath RGBA images and employing a two-stage process: image inpainting followed by image matting. This pipeline, however, struggles to preserve transparency consistency in edited regions, and matting can introduce jagged edges along transparency boundaries. To address these challenges, we propose Trans-Adapter, a plug-and-play adapter that enables diffusion-based inpainting models to process transparent images directly. Trans-Adapter also supports controllable editing via ControlNet and can be seamlessly integrated into various community models. To evaluate our method, we introduce LayerBench, along with a novel non-reference alpha edge quality evaluation metric for assessing transparency edge quality. Experimental results show that our approach outperforms existing pipelines. Our code and benchmark will be publicly available.
Poster
Jianwei Fei · Yunshu Dai · Peipeng Yu · Zhe Kong · Jiantao Zhou · Zhihua Xia

[ Exhibit Hall I ]

Abstract
The commercialization of generative artificial intelligence (GenAI) has led to a multi-level ecosystem involving model developers, service providers, and consumers. Thus, ensuring traceability is crucial, as service providers may violate intellectual property rights (IPR), and consumers may generate harmful content. However, existing methods are limited to single-level attribution scenarios and cannot simultaneously trace across multiple levels. To this end, we introduce a scalable dual fingerprinting method for text-to-image (T2I) models, to achieve traceability of both service providers and consumers. Specifically, we propose 2-headed Fingerprint-Informed Low-Rank Adaptation (FI-LoRA), where each head is controlled by a binary fingerprint and capable of introducing the fingerprints into generated images. In practice, one FI-LoRA head is used by the developer to assign a unique fingerprint to each service provider, while the other is made available to service providers for embedding consumer-specific fingerprints during image generation. Our method does not merely embed two fingerprints within the generated image but instead allows independent control over them at developer level and business level, enabling simultaneous traceability of businesses and consumers. Experiments show that our method applies to various image generation and editing tasks of multiple T2I models, and can achieve over 99.9\% extraction accuracy for both fingerprints. Our …
Poster
Junyi Wu · Zhiteng Li · Zheng Hui · YULUN ZHANG · Linghe Kong · Xiaokang Yang

[ Exhibit Hall I ]

Abstract
Recently, Diffusion Transformers (DiTs) have emerged as a dominant architecture in video generation, surpassing U-Net-based models in terms of performance. However, the enhanced capabilities of DiTs come with significant drawbacks, including increased computational and memory costs, which hinder their deployment on resource-constrained devices. Current acceleration techniques, such as quantization and cache mechanism, offer limited speedup and are often applied in isolation, failing to fully address the complexities of DiT architectures. In this paper, we propose QuantCache, a novel training-free inference acceleration framework that jointly optimizes hierarchical latent caching, adaptive importance-guided quantization, and structural redundancy-aware pruning. QuantCache achieves an end-to-end latency speedup of 6.72× on Open-Sora with minimal loss in generation quality. Extensive evaluations across multiple video generation benchmarks demonstrate the effectiveness of our method, setting a new standard for efficient DiT inference. We will release all code and models to facilitate further research.
Poster
Shivani Mall · Joao F. Henriques

[ Exhibit Hall I ]

Abstract
Continual learning (CL) promises to allow neural networks to learn from continuous streams of inputs, instead of IID (independent and identically distributed) sampling, which requires random access to a full dataset. This would allow for much smaller storage requirements and self-sufficiency of deployed systems that cope with natural distribution shifts, similarly to biological learning.We focus on video CL employing a rehearsal-based approach, which reinforces past samples from a memory buffer. We posit that part of the reason why practical video CL is challenging is the high memory requirements of video, further exacerbated by long-videos and continual streams, which are at odds with the common rehearsal-buffer size constraints. To address this, we propose to use compressed vision, i.e. store video codes (embeddings) instead of raw inputs, and train a video classifier by IID sampling from this rolling buffer. Training a video compressor online (so not depending on any pre-trained networks) means that it is also subject to catastrophic forgetting. We propose a scheme to deal with this forgetting by refreshing video codes, which requires careful decompression with a previous version of the network and recompression with a new one. We name our method Continually Refreshed Amodal Memory (CRAM). We expand current …
Poster
jiale chen · Wei Wang · Chongyang Shi · Li Dong · Xiping Hu

[ Exhibit Hall I ]

Abstract
Watermarking as a traceable authentication technology has been widely applied in image copyright protection. However, most existing watermarking methods embed watermarks by adding irremovable perturbations to the cover image, causing permanent distortion. To address this issue, we propose a novel watermarking approach termed \textbf{C}over-\textbf{R}ecoverable Water\textbf{Mark} (CRMark). CRMark can losslessly recover the cover image and watermark in lossless channels and enables robust watermark extraction in lossy channels. CRMark leverages an integer Invertible Watermarking Network (iIWN) to achieve a lossless invertible mapping between the cover-image-watermark pair and the stego image. During the training phase, CRMark employs an encoder-noise-layer-decoder architecture to enhance its robustness against distortions. In the inference phase, CRMark first maps the cover-image-watermark pair into an overflowed stego image and a latent variable. Subsequently, the overflowed pixels and the latent variable are losslessly compressed into an auxiliary bitstream, which is then embedded into the clipped stego image using reversible data hiding. During extraction, in lossy channels, the noised stego image can directly undergo inverse mapping via iIWN to extract the watermark. In lossless channels, the latent variable and overflowed stego image are first recovered using reversible data hiding, followed by watermark extraction through iIWN. Extensive experimental results demonstrate that CRMark can …
Poster
Yangyang Xu · Bangzhen Liu · Wenqi Shao · Yong Du · Shengfeng He · Tingting Zhu

[ Exhibit Hall I ]

Abstract
Decoding stimulus images from fMRI signals has advanced with pre-trained generative models. However, existing methods struggle with cross-subject mappings due to cognitive variability and subject-specific differences. This challenge arises from sequential errors, where unidirectional mappings generate partially inaccurate representations that, when fed into diffusion models, accumulate errors and degrade reconstruction fidelity. To address this, we propose the Bidirectional Autoencoder Intertwining framework for accurate mind representation prediction. Our approach unifies multiple subjects through a Subject Bias Modulation Module while leveraging bidirectional mapping to better capture data distributions for precise representation prediction. To further enhance fidelity when decoding representations into stimulus images, we introduce a Semantic Refinement Module to improve semantic representations and a Visual Coherence Module to mitigate the effects of inaccurate visual representations. Integrated with ControlNet and Stable Diffusion, our method outperforms state-of-the-art approaches on benchmark datasets in both qualitative and quantitative evaluations. Moreover, our framework exhibits strong adaptability to new subjects with minimal training samples.
Poster
Jiancheng Zhao · Yifan Zhan · Qingtian Zhu · Mingze Ma · Muyao Niu · Zunian Wan · Xiang Ji · Yinqiang Zheng

[ Exhibit Hall I ]

Abstract
Implicit Neural Representations for Videos (NeRV) have emerged as a powerful paradigm for video representation, enabling direct mappings from frame indices to video frames. However, existing NeRV-based methods do not fully exploit temporal redundancy, as they rely on uniform sampling along the temporal axis, leading to suboptimal rate-distortion (RD) performance.To address this limitation, we propose Tree-NeRV, a novel tree-structured feature representation for efficient and adaptive video encoding. Unlike conventional approaches, Tree-NeRV organizes feature representations within a Binary Search Tree (BST), enabling non-uniform sampling along the temporal axis. Additionally, we introduce an optimization-driven sampling strategy, dynamically allocating higher sampling density to regions with greater temporal variation. Extensive experiments demonstrate that Tree-NeRV achieves superior compression efficiency and reconstruction quality, outperforming prior uniform sampling-based methods. Code will be released.
Poster
Yuhang Ma · Keqiang Sun · Xiaoshi Wu · Hongsheng Li

[ Exhibit Hall I ]

Abstract
Evaluating text-to-image generation models requires alignment with human perception, yet existing human-centric metrics are constrained by limited data coverage, suboptimal feature extraction, and inefficient loss functions. To address these challenges, we introduce Human Preference Score v3 (HPSv3), which comprises: (1) HPDv3, the first full-spectrum human preference dataset integrating 1.7M text-image pairs and 1M annotated pairwise comparisons from state-of-the-art generative models and high-quality real-world images, and (2) a preference model leveraging VLM-based feature extraction and RankNet loss for fine-grained ranking. Furthermore, we propose Chain-of-Human-Preference (CoHP), a novel reasoning approach for iterative image refinement. CoHP improves image quality efficiently without requiring additional training data. By using HPSv3 as a reward model, CoHP ensures that the highest-quality image is selected at each iteration, progressively enhancing the output. Extensive experiments demonstrate that HPSv3 serves as a robust benchmark for full-spectrum image evaluation, and CoHP offers an efficient, human-aligned approach to enhancing image generation quality.
Poster
Kwanseok Kim · Jaehoon Hahm · Sumin Kim · Jinhwan Sul · Byung-Hak Kim · Joonseok Lee

[ Exhibit Hall I ]

Abstract
Video summarization is a task of shortening a video by choosing a subset of frames while preserving its essential moments. Despite the innate subjectivity of the task, previous works have deterministically regressed to an averaged frame score over multiple raters, ignoring the inherent subjectivity of what constitutes a "good" summary. We propose a novel problem formulation by framing video summarization as a conditional generation task, allowing a model to learn the distribution of good summaries and to generate multiple plausible summaries that better reflect varying human perspectives. Adopting diffusion models for the first time in video summarization, our proposed method, SummDiff, dynamically adapts to visual contexts and generates multiple candidate summaries conditioned on the input video. Extensive experiments demonstrate that SummDiff not only achieves the state-of-the-art performance on various benchmarks but also produces summaries that closely align with individual annotator preferences. Moreover, we provide a deeper insight with novel metrics from an analysis of the knapsack, which is an important last step of generating summaries but has been overlooked in evaluation.
Poster
Xueqing Deng · Linjie Yang · Qihang Yu · Chenglin Yang · Liang-Chieh (Jay) Chen

[ Exhibit Hall I ]

Abstract
Text-to-image (T2I) models have advanced rapidly with diffusion-based breakthroughs, yet their evaluation remains challenging. Human assessments are costly, and existing automated metrics lack accurate compositional understanding. To address these limitations, we introduce PSG-Bench, a novel benchmark featuring 5K text prompts designed to evaluate the capabilities of advanced T2I models. Additionally, we propose PSGEval, a scene graph-based evaluation metric that converts generated images into structured representations and applies graph matching techniques for accurate and scalable assessment. PSGEval is a detection based evaluation metric without relying on QA generations. Our experimental results demonstrate that PSGEval aligns well with human evaluations, mitigating biases present in existing automated metrics. We further provide a detailed ranking and analysis of recent T2I models, offering a robust framework for future research in T2I evaluation.
Poster
Nisha Huang · Henglin Liu · Yizhou Lin · Kaer Huang · Chubin Chen · Jie Guo · Tong-Yee Lee · Xiu Li

[ Exhibit Hall I ]

Abstract
Recent diffusion-based methods for material transfer rely on image fine-tuning or complex architectures with assistive networks, but face challenges including text dependency, extra computational costs, and feature misalignment. To address these limitations, we propose MaTe, a streamlined diffusion framework that eliminates textual guidance and reference networks. MaTe integrates input images at the token level, enabling unified processing via multi-modal attention in a shared latent space. This design removes the need for additional adapters, ControlNet, inversion sampling, or model fine-tuning. Extensive experiments demonstrate that MaTe achieves high-quality material generation under a zero-shot, training-free paradigm. It outperforms state-of-the-art methods in both visual quality and efficiency while preserving precise detail alignment, significantly simplifying inference prerequisites.
Poster
Yifei Zhang · Lei Chen

[ Exhibit Hall I ]

Abstract
Driven by large-scale model iterations, the inference speed and generalization ability of 3D model generation have improved significantly. However, the quality of existing methods still falls short of enabling direct use without post-processing. Common issues include insufficient texture clarity, loss of semantic information, lack of fine-grained detail, and the generation of redundant artifacts. Moreover, current approaches focus solely on producing static structures, where individual components remain non-movable, without considering functional applications in the generation process. To address these limitations, we draw inspiration from LEGO-like modular construction and decompose complex models into semantically functional components. We propose LEGO-Maker, a novel framework that reformulates the text-to-3D task into a three-stage process: target image generation, functional semantic decomposition, and multi-task 3D generation with structured fusion. Leveraging a reorganized high-quality 3D dataset, we train a Diffusion model and a semantic segmentation model tailored for 3D generation tasks. Additionally, we design a motion-driven mechanism to introduce action sequences for functionally interactive modules after model fusion. Experimental results demonstrate that, compared to existing methods, our approach significantly enhances semantic understanding, model detail quality, and text consistency while showcasing direct applicability across various scenarios.
Poster
Peng Cai · liqiang liqiang · Kaicheng Yang · guodong guodong · lijia lijia · zhounan zhounan · Xiang An · Ninghua Yang · Jiankang Deng

[ Exhibit Hall I ]

Abstract
Document image rectification aims to eliminate geometric deformation in photographed documents to facilitate text recognition. However, existing methods often neglect the significance of foreground elements, which provide essential geometric references and layout information for document image correction. In this paper, we introduce \textbf{For}eground-\textbf{Cen}tric \textbf{Net}work~(\textbf{ForCenNet}) to eliminate geometric distortions in document images. Specifically, we initially propose a foreground-centric label generation method, which extracts detailed foreground elements from an undistorted image. Then we introduce a foreground-centric mask mechanism to enhance the distinction between readable and background regions. Furthermore, we design a curvature consistency loss to leverage the detailed foreground labels to help the model understand the distorted geometric distribution. Extensive experiments demonstrate that ForCenNet achieves new state-of-the-art on four real-world benchmarks, such as DocUNet, DIR300, WarpDoc, and DocReal. Quantitative analysis shows that the proposed method effectively undistorts layout elements, such as text lines and table borders. Our training code and pre-trained models will be released to facilitate future research.
Poster
Shoubin Yu · Difan Liu · Ziqiao Ma · Yicong Hong · Yang Zhou · Hao Tan · Joyce Chai · Mohit Bansal

[ Exhibit Hall I ]

Abstract
Recent video diffusion models have enhanced video editing, but it remains challenging to handle instructional editing and diverse tasks (e.g., adding, removing, changing) within a unified framework.In this paper, we introduce VEGGIE, a Video Editor with Grounded Generation from Instructions, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. Specifically, given a video and text query, VEGGIE first utilizes an MLLM to interpret user intentions in instructions and ground them to the video contexts, generating frame-specific grounded task queries for pixel-space responses. A diffusion model then renders these plans and generates edited videos that align with user intent.To support diverse tasks and complex instructions, we employ a curriculum learning strategy: first aligning the MLLM and video diffusion model with large-scale instructional image editing data, followed by end-to-end fine-tuning on high-quality multitask video data. Additionally, we introduce a novel data synthesis pipeline to generate paired instructional video editing data for model training. It transforms static image data into diverse, high-quality video editing samples by leveraging Image-to-Video models to inject dynamics.VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model, while other models …
Poster
Songlin Yang · Yushi LAN · Honghua Chen · Xingang Pan

[ Exhibit Hall I ]

Abstract
Textured 3D morphing creates smooth and plausible interpolation sequences between two 3D objects, focusing on transitions in both shape and texture. This is important for creative applications like visual effects in filmmaking. Previous methods rely on establishing point-to-point correspondences and determining smooth deformation trajectories, which inherently restrict them to shape-only morphing on untextured, topologically aligned datasets. This restriction leads to labor-intensive preprocessing and poor generalization. To overcome these challenges, we propose a method for 3D regenerative morphing using a 3D diffusion prior. Unlike previous methods that depend on explicit correspondences and deformations, our method eliminates the additional need for obtaining correspondence and uses the 3D diffusion prior to generate morphing. Specifically, we first introduce a 3D diffusion model and interpolate the source and target information at three levels: initial noise, model parameters, and condition features. We then explore an Attention Fusion strategy to generate smoother morphing sequences. To further improve the plausibility of semantic interpolation and the generated 3D surfaces, we propose two strategies: (a) Token Reordering, where we match approximate tokens based on semantic analysis to guide implicit correspondences in the denoising process of the diffusion model, and (b) Low-Frequency Enhancement, where we enhance low-frequency signals in the tokens …
Poster
Chao Zhou · Tianyi Wei · Nenghai Yu

[ Exhibit Hall I ]

Abstract
Recent advancements in unified image generation models, such as OmniGen, have enabled the handling of diverse image generation and editing tasks within a single framework, accepting multimodal, interleaved texts and images in free form. This unified architecture eliminates the need for text encoders, greatly reducing model complexity and standardizing various image generation and editing tasks, making it more user-friendly. However, we found that it suffers from text instruction neglect, especially when the text instruction contains multiple sub-instructions. To explore this issue, we performed a perturbation analysis on the input to identify critical steps and layers. By examining the cross-attention maps of these key steps, we observed significant conflicts between neglected sub-instructions and the activations of the input image. In response, we propose **Self-Adaptive Attention Scaling (SaaS)**, a method that leverages the consistency of cross-attention between adjacent timesteps to dynamically scale the attention activation for each sub-instruction. Our SaaS enhances instruction-following fidelity without requiring additional training or test-time optimization. Experimental results on instruction-based image editing and visual conditional image generation validate the effectiveness of our SaaS, showing superior instruction-following fidelity over existing methods.
Poster
Shengfang ZHAI · Jiajun Li · Yue Liu · Huanran Chen · Zhihua Tian · Wenjie Qu · Qingni Shen · Ruoxi Jia · Yinpeng Dong · Jiaheng Zhang

[ Exhibit Hall I ]

Abstract
In recent years, text-to-image (T2I) diffusion models have garnered significant attention for their ability to generate high-quality images reflecting text prompts. However, their growing popularity has also led to the emergence of backdoor threats, posing substantial risks. Currently, effective defense strategies against such threats are lacking due to the diversity of backdoor targets in T2I synthesis. In this paper, we propose NaviDet, the first general input-level backdoor detection framework for identifying backdoor inputs across various backdoor targets. Our approach is based on the new observation that trigger tokens tend to induce significant neuron activation variation in the early stage of the diffusion generation process, a phenomenon we term Early-step Activation Variation. Leveraging this insight, NaviDetdetects malicious samples by analyzing neuron activation variations caused by input tokens. Through extensive experiments, we demonstrate the effectiveness and efficiency of our method against various T2I backdoor attacks, surpassing existing baselines with significantly lower computational overhead. Furthermore, we rigorously demonstrate that our method remains effective against potential adaptive attacks.
Poster
Yi Liu · Shengqian Li · Zuzeng Lin · Feng Wang · Si Liu

[ Exhibit Hall I ]

Abstract
The current conditional autoregressive image generation methods have shown promising results, yet their potential remains largely unexplored in the practical unsupervised image translation domain, which operates without explicit cross-domain correspondences.A critical limitation stems from the discrete quantization inherent in traditional Vector Quantization-based frameworks, which disrupts gradient flow between the Variational Autoencoder decoder and causal Transformer, impeding end-to-end optimization during adversarial training in image space.To tackle this issue, we propose using Softmax Relaxed Quantization, a novel approach that reformulates codebook selection as a continuous probability mixing process via Softmax, thereby preserving gradient propagation. Building upon this differentiable foundation, we introduce CycleVAR, which reformulates image-to-image translation as image-conditional visual autoregressivegeneration by injecting multi-scale source image tokens as contextual prompts, analogous to prefix-based conditioning in language models.CycleVAR exploits two modes to generate the target image tokens, including (1) serial multi-step generation enabling iterative refinement across scales and (2) parallel one-step generation synthesizing all resolution outputs in a single forward pass.Experimental findings indicate that the parallel one-step generation mode attains superior translation quality with quicker inference speed than the serial multi-step mode in unsupervised scenarios.Furthermore, both quantitativeand qualitative results indicate that CycleVAR surpasses previous state-of-the-artunsupervised image translation models, e.g., CycleGAN-Turbo.
Poster
yinhan Zhang · Yue Ma · Bingyuan Wang · Qifeng Chen · Zeyu Wang

[ Exhibit Hall I ]

Abstract
We present MagicColor, a diffusion-based framework for multi-instance sketch colorization. The production of multi-instance 2D line art colorization adheres to an industry-standard workflow, which consists of three crucial stages: the design of line art characters, the coloring of individual objects, and the refinement process. The artists are required to repeat the process to color each instance one by one, which is inaccurate and inefficient. Meanwhile, current generative methods fail to solve this task due to the challenge of multi-instance pair data collection. To tackle these challenges, we incorporate three technical designs to ensure precise character detail transcription and achieve multi-instance sketch colorization in a single forward. Specifically, we first propose the self-play training strategy to solve the lack of training data. Then the instance guider is introduced to feed the color of the instance. To achieve accurate color matching, we present the fine-grained color matching with edge loss to enhance visual quality. Equipped with the proposed module, MagicColor enables automatically transforming sketches into vividly-colored animations in accurate consistency with multi-reference characters.Experiments on a self-collected benchmark demonstrate the superiority of our model over current solutions in terms of precise colorization. Our model could even automate the colorization process, such that users …
Poster
Shengqi Dang · Yi He · Long Ling · Ziqing Qian · Nanxuan Zhao · Nan Cao

[ Exhibit Hall I ]

Abstract
Recent research shows that emotions can enhance users' cognition and influence information communication. While research on visual emotion analysis is extensive, limited work has been done on helping users generate emotionally rich image content. Existing work on emotional image generation relies on discrete emotion categories, making it challenging to capture complex and subtle emotional nuances accurately. Additionally, these methods struggle to control the specific content of generated images based on text prompts. In this paper, we introduce the task of continuous emotional image content generation (C-EICG) and present EmotiCrafter, a general emotional image generation model that generates images based on free text prompts and Valence-Arousal (V-A) values. It leverages a novel emotion-embedding mapping network to fuse V-A values into textual features, enabling the capture of emotions in alignment with intended input prompts. A novel loss function is also proposed to enhance emotion expression. The experimental results show that our method effectively generates images representing specific emotions with the desired content and outperforms existing techniques.
Poster
Longfei Huang · Yu Liang · Hao Zhang · Jinwei Chen · Wei Dong · Lunde Chen · Wanyu Liu · Bo Li · Peng-Tao Jiang

[ Exhibit Hall I ]

Abstract
Recent interactive matting methods have demonstrated satisfactory performance in capturing the primary regions of objects, but they fall short in extracting fine-grained details in edge regions. Diffusion models trained on billions of image-text pairs, demonstrate exceptional capability in modeling highly complex data distributions and synthesizing realistic texture details, while exhibiting robust text-driven interaction capabilities, making them an attractive solution for interactive matting. To this end, we propose SDMatte, a diffusion-driven interactive matting model, with three key contributions. First, we exploit the powerful priors of the pre-trained U-Net within diffusion models and transform the text-driven interaction mechanism into a visual prompt-driven interaction mechanism to enable interactive matting. Second, we integrate coordinate embeddings of visual prompts and opacity embeddings of objects into U-Net, enhancing SDMatte's sensitivity to spatial position information and opacity information. Third, we propose a masked self-attention mechanism and a visual prompt-driven interaction mechanism that enable the model to focus on areas specified by visual prompts, leading to better performance. Extensive experiments on multiple datasets demonstrate the superior performance of our method, validating its effectiveness in interactive matting. Code will be made publicly available.
Poster
Kumara Kahatapitiya · Haozhe Liu · Sen He · Ding Liu · Menglin Jia · Chenyang Zhang · Michael Ryoo · Tian Xie

[ Exhibit Hall I ]

Abstract
Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs)--- despite making significant headway in this context--- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that 'not all videos are created equal': meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, essentially controlling the compute allocation based on motion content. Altogether, our plug-and-play contributions grant significant inference speedups (e.g. up to 4.7x on Open-Sora 720p - 2s video generation) without sacrificing the generation quality, across multiple video DiT baselines.
Poster
Gaoyang Zhang · Bingtao Fu · Qingnan Fan · Qi Zhang · Runxing Liu · Hong Gu · Huaqi Zhang · Xinguo Liu

[ Exhibit Hall I ]

Abstract
Text-to-image (T2I) diffusion models excel at generating photorealistic images, but commonly struggle to render accurate spatial relationships described in text prompts. We identify two core issues underlying this common failure: 1) the ambiguous nature of spatial-related data in existing datasets, and 2) the inability of current text encoders to accurately interpret the spatial semantics of input descriptions. We address these issues with CoMPaSS, a versatile training framework that enhances spatial understanding of any T2I diffusion model. CoMPaSS solves the ambiguity of spatial-related data with the Spatial Constraints-Oriented Pairing (SCOP) data engine, which curates spatially accurate training data through a set of principled spatial constraints. To better exploit the curated high-quality spatial priors, CoMPaSS further introduces a Token ENcoding ORdering (TENOR) module to allow better exploitation of high-quality spatial priors, effectively compensating for the shortcoming of text encoders. Extensive experiments on four popular open-weight T2I diffusion models covering both UNet- and MMDiT-based architectures demonstrate the effectiveness of CoMPaSS by setting new state-of-the-arts with substantial relative gains across well-known benchmarks on spatial relationships generation, including VISOR (+98%), T2I-CompBench Spatial (+67%), and GenEval Position (+131%).
Poster
Guangben Lu · Yuzhen N/A · Zhimin Sun · Ran Yi · Yifan Qi · Yizhe Tang · Tianyi Wang · Lizhuang Ma · FangYuan Zou

[ Exhibit Hall I ]

Abstract
Foreground-conditioned inpainting aims to seamlessly fill the background region of an image by utilizing the provided foreground subject and a text description. While existing T2I-based image inpainting methods can be applied to this task, they suffer from issues of subject shape expansion, distortion, or impaired ability to align with the text description, resulting in inconsistencies between the visual elements and the text description. To address these challenges, we propose Pinco, a plug-and-play foreground-conditioned inpainting adapter that generates high-quality backgrounds with good text alignment while effectively preserving the shape of the foreground subject. Firstly, we design a Self-Consistent Adapter that integrates the foreground subject features into the layout-related self-attention layer, which helps to alleviate conflicts between the text and subject features by ensuring that the model can effectively consider the foreground subject's characteristics while processing the overall image layout. Secondly, we design a Decoupled Image Feature Extraction method that employs distinct architectures to extract semantic and spatial features separately, significantly improving subject feature extraction and ensuring high-quality preservation of the subject's shape. Thirdly, to ensure precise utilization of the extracted features and to focus attention on the subject region, we introduce a Shared Positional Embedding Anchor, greatly improving the model's understanding …
Poster
Qingyan Bai · Hao Ouyang · Yinghao Xu · Qiuyu Wang · Ceyuan Yang · Ka Leong Cheng · Yujun Shen · Qifeng Chen

[ Exhibit Hall I ]

Abstract
As a verified need, consistent editing across in-the-wild images remains a technical challenge arising from various unmanageable factors, like object poses, lighting conditions, and photography environments. Edicho steps in with a training-free solution based on diffusion models, featuring a fundamental design principle of using explicit image correspondence to direct editing. Specifically, the key components include an attention manipulation module and a carefully refined classifier-free guidance (CFG) denoising strategy, both of which take into account the pre-estimated correspondence. Such an inference-time algorithm enjoys a plug-and-play nature and is compatible to most diffusion-based editing methods, such as ControlNet and BrushNet. Extensive results demonstrate the efficacy of Edicho in consistent cross-image editing under diverse settings. We will release the code to facilitate future studies.
Poster
YuanFu Yang · Hsiu-Hui Hsiao

[ Exhibit Hall I ]

Abstract
This paper presents the Implicit Knowledge Distillation Diffusion Transformer (IKDDiT), a groundbreaking model tailored for photolithography overlay map generation in semiconductor manufacturing. IKDDiT effectively addresses the challenges of open-vocabulary overlay map generation by integrating pre-trained image-text encoders, diffusion models, and masked transformers. Utilizing advanced text-to-image diffusion and image-text discriminative models, it generates high-fidelity overlay maps across multiple photolithography layers, significantly mitigating overlay misregistration errors and minimizing productivity losses caused by wafer rework. Key innovations include an implicit knowledge distillation framework that refines inter-image alignment by decoupling discriminative and generative tasks via an implicit discriminator, as well as a gated cross-attention mechanism to enhance generative performance. Experimental results demonstrate that IKDDiT achieves an optimal trade-off between efficiency and accuracy, providing a scalable, robust solution poised to advance overlay map generation in semiconductor processes.
Poster
Worameth Chinchuthakun · Tossaporn Saengja · Nontawat Tritrong · Pitchaporn Rewatbowornwong · Pramook Khungurn · Supasorn Suwajanakorn

[ Exhibit Hall I ]

Abstract
While diffusion models show promising results in image editing given a target prompt, achieving both prompt fidelity and background preservation remains difficult. Recent works have introduced score distillation techniques that leverage the rich generative prior of text-to-image diffusion models to solve this task without additional fine-tuning. However, these methods often struggle with tasks such as object insertion. Our investigation of these failures reveals significant variations in gradient magnitude and spatial distribution, making hyperparameter tuning highly input-specific or unsuccessful. To address this, we propose two simple yet effective modifications: attention-based spatial regularization and gradient filtering-normalization, both aimed at reducing these variations during gradient updates. Experimental results show our method outperforms state-of-the-art score distillation techniques in prompt fidelity, improving successful edits while preserving the background. Users also preferred our method over state-of-the-art techniques across three metrics, and by 58-64\% overall.
Poster
Maitreya Patel · Song Wen · Dimitris Metaxas · Yezhou Yang

[ Exhibit Hall I ]

Abstract
Despite recent advances in Rectified Flow Models (RFMs), unlocking their full potential for controlled generation tasks—such as inverse problems and image editing—remains a significant hurdle. Although RFMs and Diffusion Models (DMs) represent state-of-the-art approaches in generative modeling, their reliance on computationally demanding backpropagation through ODE solvers and inversion strategies often undermines efficiency and precision. In this paper, we present `FlowChef`, a novel training, inversion, and gradient-free inference-time steering strategy for RFMs that deterministically guides the denoising process. We first develop a theoretical and empirical understanding of the vector-field dynamics of RFMs in efficiently guiding the denoising trajectory. Specifically, leveraging the straightness and smooth Jacobian properties, we derive the mathematical relationship between gradients of rectified flow ODEs. We extend our theoretical findings to solve linear-inverse problems, image editing, classifier guidance, and many more tasks. We perform extensive evaluations and show that `FlowChef` significantly exceeds baselines in terms of performance, memory, and time requirements, achieving new state-of-the-art results. Remarkably, for the first time, it scales effortlessly to billion-parameter models such as Flux. We release code and demos at: https://anonymous.4open.science/r/FlowChef/
Poster
Xiaohui Li · Yihao Liu · Shuo Cao · Chen Ziyan · SHAOBIN ZHUANG · Xiangyu Chen · Yinan He · Yi Wang · Yu Qiao

[ Exhibit Hall I ]

Abstract
Diffusion models have demonstrated exceptional capabilities in image restoration, yet their application to video super-resolution (VSR) faces significant challenges in balancing fidelity with temporal consistency. Our evaluation reveals a critical gap: existing approaches consistently fail on severely degraded videos--precisely where diffusion models' generative capabilities are most needed. We identify that existing diffusion-based VSR methods struggle primarily because they face an overwhelming learning burden: simultaneously modeling complex degradation distributions, content representations, and temporal relationships with limited high-quality training data. To address this fundamental challenge, we present DiffVSR, featuring a Progressive Learning Strategy (PLS) that systematically decomposes this learning burden through staged training, enabling superior performance on complex degradations. Our framework additionally incorporates an Interweaved Latent Transition (ILT) technique that maintains competitive temporal consistency without additional training overhead. Experiments demonstrate that our approach excels in scenarios where competing methods struggle, particularly on severely degraded videos. Our work reveals that addressing the learning strategy, rather than focusing solely on architectural complexity, is the critical path toward robust real-world video super-resolution with diffusion models.
Poster
Le Zhuo · Liangbing Zhao · Sayak Paul · Yue Liao · Renrui Zhang · Yi Xin · Peng Gao · Mohamed Elhoseiny · Hongsheng Li

[ Exhibit Hall I ]

Abstract
Recent text-to-image diffusion models achieve impressive visual quality through extensive scaling of training data and model parameters, yet they often struggle with complex scenes and fine-grained details. Inspired by the self-reflection capabilities emergent in large language models, we propose ReflectionFlow, an inference-time framework enabling diffusion models to iteratively reflect upon and refine their outputs. ReflectionFlow introduces three complementary inference-time scaling axes: (1) noise-level scaling to optimize latent initialization; (2) prompt-level scaling for precise semantic guidance, and most notably, (3) reflection-level scaling, which explicitly models actionable reflections to iteratively assess and correct previously generated images. To facilitate reflection-level scaling, we construct GenRef, a large-scale dataset comprising 800K triplets, each containing a reflection, a flawed image, and an enhanced image. Leveraging this dataset, we efficiently fine-tune state-of-the-art diffusion transformers, FLUX.1-dev, by jointly modeling multimodal inputs within a unified framework. Experimental results show that ReflectionFlow significantly outperforms naive noise-level scaling methods, offering a scalable and compute-efficient solution toward higher-quality image synthesis on challenging tasks. All code, checkpoint, and dataset will be released soon.
Poster
Teng Zhou · Xiaoyu Zhang · Yongchuan Tang

[ Exhibit Hall I ]

Abstract
Panoramic Image Generation (PIG) aims to create coherent images of arbitrary lengths. Most existing methods fall in the joint diffusion paradigm, but their complex and heuristic crop connection designs often limit their ability to achieve multilevel coherence. By deconstructing this challenge into its core components, we find it naturally aligns with next-token prediction, leading us to adopt an autoregressive (AR) paradigm for PIG modeling. However, existing visual AR (VAR) models are limited to fixed-size generation, lacking the capability to produce panoramic images. In this paper, we propose PanoLlama, a novel framework that achieves endless and coherent panorama generation with the autoregressive paradigm. Our approach develops a training-free strategy that utilizes token redirection to overcome the size limitations of existing VAR models, enabling next-crop prediction in both horizontal and vertical directions. This refreshes the PIG pipeline while achieving SOTA performance in coherence (47.50\%), fidelity(28.16\%), and aesthetics (15\%). Additionally, PanoLlama supports applications other PIG methods cannot achieve, including mask-free layout control, multi-scale and multi-guidance synthesis. To facilitate standardized evaluation, we also establish a dataset with 1,000 prompts spanning 100+ themes, providing a new testing benchmark for PIG research.
Poster
Fitim Abdullahu · Helmut Grabner

[ Exhibit Hall I ]

Abstract
Our daily life is highly influenced by what we consume and see. Attracting and holding one's attention -- the definition of (visual) interestingness -- is essential. The rise of Large Multimodal Models (LMMs) trained on large-scale visual and textual data has demonstrated impressive capabilities. We explore these models' potential to understand to what extent the concepts of visual interestingness are captured and examine the alignment between human assessments and GPT-4o's, a leading LMM, predictions through comparative analysis. Our studies reveal partial alignment between humans and GPT-4o. It already captures the concept as best compared to state-of-the-art methods. Hence, this allows for the effective labeling of image pairs according to their (commonly) interestingness, which are used as training data to distill the knowledge into a learning-to-rank model. The insights pave the way for a deeper understanding of human interest.
Poster
Eunseo Koh · SeungHoo Hong · Tae-Young Kim · Jae-Pil Heo · Simon Woo

[ Exhibit Hall I ]

Abstract
Text-to-Image (T2I) diffusion models have made significant progress in generating diverse high-quality images from textual prompts. However, these models still face challenges in suppressing content that is strongly entangled with specific words. For example, when generating an image of "Charlie Chaplin", a "mustache" consistently appears even if explicitly instructed not to include it, as the concept of "mustache" is strongly entangled with "Charlie Chaplin". To address this issue, we propose a novel approach to directly suppress such entangled content within the text embedding space of diffusion models. Our method introduces a delta vector that modifies the text embedding to weaken the influence of undesired content in the generated image, and we further demonstrate that this delta vector can be easily obtained through a zero-shot approach. Furthermore, we propose a Selective Suppression with Delta Vector (SSDV) method to adapt the delta vector into the cross-attention mechanism, enabling more effective suppression of unwanted content in regions where it would otherwise be generated. Additionally, we enabled more precise suppression in personalized T2I models by optimizing the delta vector, which previous baselines were unable to achieve. Extensive experimental results demonstrate that our approach significantly outperforms existing methods, both in terms of quantitative and qualitative …
Poster
Junhyuk So · Juncheol Shin · Hyunho Kook · Eunhyeok Park

[ Exhibit Hall I ]

Abstract
Recently, autoregressive (AR) image models have demonstrated remarkable generative capabilities, positioning themselves as a compelling alternative to diffusion models. However, their sequential nature leads to long inference times, limiting their practical scalability. In this work, we introduce Grouped Speculative Decoding (GSD), a novel, training-free acceleration method for AR image models. While recent studies have explored Speculative Decoding (SD) as a means to speed up AR image generation, existing approaches either provide only modest acceleration or require additional training. Our in-depth analysis reveals a fundamental difference between language and image tokens: image tokens exhibit inherent redundancy and diversity, meaning multiple tokens can convey valid semantics. However, traditional SD methods are designed to accept only a single most-likley token, which fails to leverage this difference, leading to excessive false-negative rejections. To address this, we propose a new SD strategy that evaluates clusters of visually valid tokens rather than relying on a single target token. Additionally, we observe that static clustering based on embedding distance is ineffective, which motivates our dynamic GSD approach. Extensive experiments show that GSD accelerates AR image models by an average of 3.7× while preserving image quality—all without requiring any additional training.
Poster
Bhishma Dedhia · David Bourgin · Krishna Kumar Singh · Yuheng Li · Yan Kang · Zhan Xu · Niraj Jha · Yuchen Liu

[ Exhibit Hall I ]

Abstract
Diffusion Transformers (DiTs) can generate short photorealistic videos, yet directly training and sampling longer videos with full attention across the video remains computationally challenging. Alternative methods break long videos down into sequential generation of short video segments, requiring multiple sampling chain iterations and specialized consistency modules. To overcome these challenges, we introduce a new paradigm called Video Interface Networks (VINs), which augment DiTs with an abstraction module to enable parallel inference of video chunks. At each diffusion step, VINs encode global semantics from the noisy input of local chunks and the encoded representations, in turn, guide DiTs in denoising chunks in parallel. The coupling of VIN and DiT is learned end-to-end on the denoising objective. Further, the VIN architecture maintains fixed-size encoding tokens that encode the input via a single cross-attention step. Disentangling the encoding tokens from the input thus enables VIN to scale to long videos and learn essential semantics. Experiments on VBench demonstrate that VINs surpass existing chunk-based methods in preserving background consistency and subject coherence. We then show via an optical flow analysis that our approach attains state-of-the-art motion smoothness while using 25-40\% fewer FLOPs than full generation. Finally, human raters favorably assessed the overall video quality …
Poster
Aniket Roy · Shubhankar Borse · Shreya Kadambi · Debasmit Das · Shweta Mahajan · Risheek Garrepalli · Hyojin Park · Ankita Nayak · Rama Chellappa · Munawar Hayat · Fatih Porikli

[ Exhibit Hall I ]

Abstract
We tackle the challenge of jointly personalizing content and style from a few examples. A promising approach is to train separate Low-Rank Adapters (LoRA) and merge them effectively, preserving both content and style. Existing methods, such as ZipLoRA, treat content and style as independent entities, merging them by learning masks in LoRA's output dimensions. However, content and style are intertwined, not independent. To address this, we propose DuoLoRA—a content-style personalization framework featuring three key components: (1) rank-dimension mask learning, (2) effective merging via layer priors, and (3) Constyle loss, which leverages cycle-consistency in the merging process.First, we introduce ZipRank, which performs content-style merging within the rank dimension, offering adaptive rank flexibility and significantly reducing the number of learnable parameters. Additionally, we incorporate SDXL layer priors to apply implicit rank constraints informed by each layer’s content-style bias and adaptive merger initialization, enhancing the integration of content and style. To further refine the merging process, we introduce Constyle loss, which leverages the cycle consistency between content and style.Our experimental results demonstrate that DuoLoRA outperforms state-of-the-art content-style merging methods across multiple benchmarks.
Poster
Dae-Young Song · Jung-Jae Yu · Donghyeon Cho

[ Exhibit Hall I ]

Abstract
Latent diffusion models have demonstrated superior performance over traditional methods in generating highly detailed and aesthetically pleasing images, which makes them widely used for various image generation and editing tasks, including outpainting. However, most LDM-based outpainting methods impose constraints on resolution and aspect ratio, often leading to the loss of local details and blurring.One way to address these issues is progressive outpainting, where the image is extended outward incrementally. However, naive progressive outpainting suffers from two key challenges: (1) difficulty in effectively capturing global context, making it hard to maintain the original context, and (2) a tendency to generate unnatural patterns. These challenges are particularly pronounced in art, where artists pre-design the composition before painting. As a result, existing methods often introduce visual inconsistencies that distract the viewer and diminish the intended artistic emphasis. To address these limitations, we propose two types of composition planning module that enhance progressive outpainting by leveraging global structural guidance. These modules guide a pre-trained stable diffusion model to consider the overall composition, enabling realistic and contextually appropriate artwork completion without labor-intensive user prompts. Through experiments on diverse artwork images, we show the effectiveness of our proposed method both quantitatively and qualitatively.
Poster
Han Fang · Kejiang Chen · Zehua Ma · Jiajun Deng · Yicong Li · Weiming Zhang · Ee-Chien Chang

[ Exhibit Hall I ]

Abstract
Robustness is significant for generative image watermarking, typically achieved by injecting distortion-invariant watermark features. The leading paradigm, \emph{i.e.}, inversion-based framework, excels against non-geometric distortions but struggles with geometric ones. Due to the complexity of geometric distortions, finding universally geometric-invariant features is challenging, and it is not clear whether such invariant representation exists. To address this, we propose SynTag, a \textbf{syn}chronization \textbf{tag} injection-based method that enhances geometric robustness in inversion-based schemes. Instead of seeking invariant representations, we embed a sensitive template feature alongside the watermarking features. This template evolves with geometric distortions, allowing us to reconstruct the distortion trajectory for correction before extraction. Focusing on latent diffusion models, we fine-tune the VAE decoder to inject the invisible SynTag feature, pairing it with a prediction network for extraction and correction. Additionally, we introduce a dither compensation mechanism to further improve correction accuracy. SynTag is highly compatible with existing inversion-based methods. Extensive experiments demonstrate a significant boost in geometric distortion robustness while maintaining resilience against non-geometric distortions.
Poster
Hongjae Lee · Myungjun Son · Dongjea Kang · Seung-Won Jung

[ Exhibit Hall I ]

Abstract
Despite the success of diffusion models in image generation tasks such as text-to-image, the enormous computational complexity of diffusion models limits their use in resource-constrained environments. To address this, network quantization has emerged as a promising solution for designing efficient diffusion models. However, existing diffusion model quantization methods do not consider input conditions, such as text prompts, as an essential source of information for quantization. In this paper, we propose a novel quantization method dubbed Quantization of Language-to-Image diffusion models using text Prompts (QLIP). QLIP leverages text prompts to guide the selection of bit precision for every layer at each time step. In addition, QLIP can be seamlessly integrated into existing quantization methods to enhance quantization efficiency. Our extensive experiments demonstrate the effectiveness of QLIP in reducing computational complexity and improving the quality of the generated images across various datasets.
Poster
Guibao SHEN · Luozhou Wang · Jiantao Lin · Wenhang Ge · CHAOZHE ZHANG · Xin Tao · Di ZHANG · Pengfei Wan · Guangyong Chen · Yijun Li · Ying-Cong Chen

[ Exhibit Hall I ]

Abstract
Recent advancements in text-to-image generation have been propelled by the development of diffusion models and multi-modality learning. However, since text is typically represented sequentially in these models, it often falls short in providing accurate contextualization and structural control. So the generated images do not consistently align with human expectations, especially in complex scenarios involving multiple objects and relationships. In this paper, we introduce the Scene Graph Adapter(SG-Adapter), leveraging the structured representation of scene graphs to rectify inaccuracies in the original text embeddings. The SG-Adapter's explicit and non-fully connected graph representation greatly improves the fully connected, transformer-based text representations. This enhancement is particularly notable in maintaining precise correspondence in scenarios involving multiple relationships. To address the challenges posed by low-quality annotated datasets like Visual Genome, we have manually curated a highly clean, multi-relational scene graph-image paired dataset MultiRels. Furthermore, we design three metrics derived from GPT-4V to effectively and thoroughly measure the correspondence between images and scene graphs. Both qualitative and quantitative results validate the efficacy of our approach in controlling the correspondence in multiple relationships.
Poster
Rongyao Fang · Chengqi Duan · Kun Wang · Hao Li · Linjiang Huang · Hao Tian · Xingyu Zeng · Rui Zhao · Jifeng Dai · Hongsheng Li · Xihui Liu

[ Exhibit Hall I ]

Abstract
Recent advancements in multimodal foundation models have yielded significant progress in vision-language understanding. Initial attempts have also explored the potential of multimodal large language models for visual content generation. However, existing approaches face a trade-off between generation diversity and controllability, struggling to meet the varying granularity demands of different image generation tasks within a unified MLLM framework. In this work, we propose PUMA, emPowering Unified MLLM with Multi-grAnular visual generation, a novel paradigm that tackles the diversity-controllability trade-off. PUMA achieves this by unifying multi-granular visual features as both inputs and outputs of MLLMs, thus effectively meeting the distinct granularity needs for diverse generation and precise manipulation within a single framework. Following multimodal pretraining and instruction tuning, PUMA demonstrates remarkable capabilities in a wide range of multimodal tasks, including image understanding, diverse text-to-image generation, editing, inpainting, colorization, and conditional generation. This work marks a significant stride towards realizing truly unified MLLMs capable of seamlessly adapting to the diverse granularity demands and task requirements inherent in various visual tasks. The code and model will be released upon acceptance.
Poster
Sagi Polaczek · Yuval Alaluf · Elad Richardson · Yael Vinker · Daniel Cohen-Or

[ Exhibit Hall I ]

Abstract
Vector graphics are essential in design, providing artists with a versatile medium for creating resolution-independent and highly editable visual content. Recent advancements in vision-language and diffusion models have fueled interest in text-to-vector graphics generation. However, existing approaches often suffer from over-parameterized outputs or treat the layered structure — a core feature of vector graphics — as a secondary goal, diminishing their practical use. Recognizing the importance of layered SVG representations, we propose NeuralSVG, an implicit neural representation for generating vector graphics from text prompts. Inspired by Neural Radiance Fields (NeRFs), NeuralSVG encodes the entire scene into the weights of a small MLP network, optimized using Score Distillation Sampling (SDS). To encourage a layered structure in the generated SVG, we introduce a dropout-based regularization technique that strengthens the standalone meaning of each shape. We additionally demonstrate that utilizing a neural representation provides an added benefit of inference-time control, enabling users to dynamically adapt the generated SVG based on user-provided inputs, all with a single learned representation. Through extensive qualitative and quantitative evaluations, we demonstrate that NeuralSVG outperforms existing methods in generating structured and flexible SVG.
Poster
Khaled Abud · Sergey Lavrushkin · Alexey Kirillov · Dmitriy Vatolin

[ Exhibit Hall I ]

Abstract
Diffusion-based models have recently revolutionized image generation, achieving unprecedented levels of fidelity. However, consistent generation of high-quality images remains challenging partly due to the lack of conditioning mechanisms for perceptual quality. In this work, we propose methods to integrate image quality assessment (IQA) models into diffusion-based generators, enabling quality-aware image generation. We show that diffusion models can learn complex qualitative relationships from both IQA models’ outputs and internal activations. First, we experiment with gradient-based guidance to optimize image quality directly and show this method has limited generalizability. To address this, we introduce IQA-Adapter, a novel framework that conditions generation on target quality levels by learning the implicit relationship between images and quality scores. When conditioned on high target quality, IQA-Adapter can shift the distribution of generated images towards a higher-quality subdomain, and, inversely, it can be used as a degradation model, generating progressively more distorted images when provided with a lower-quality signal. Under high-quality condition, IQA-Adapter achieves up to a 10\% improvement across multiple objective metrics, as confirmed by a user preference study, while preserving generative diversity and content. Furthermore, we extend IQA-Adapter to a reference-based conditioning scenario, utilizing the rich activation space of IQA models to transfer highly specific, …
Poster
Jiajun Luo · Lizhuo Luo · Jianru Xu · Jiajun Song · Rongwei Lu · Chen Tang · Zhi Wang

[ Exhibit Hall I ]

Abstract
Mixture-of-Experts-based (MoE-based) diffusion models demonstrate remarkable scalability in high-fidelity image generation, yet their reliance on expert parallelism introduces critical communication bottlenecks. State-of-the-art methods alleviate such overhead in parallel diffusion inference through computation-communication overlapping, termed displaced parallelism. However, we identify that these techniques inherently induce severe *staleness*-the utilization of outdated activations from previous timesteps that significantly degrades quality, especially in expert-parallel scenarios. We tackle this fundamental tension and propose DICE, a staleness-centric optimization framework with a three-fold approach: (1) Interweaved Parallelism introduces staggered pipelines, effectively halving step-level staleness for free; (2) Selective Synchronization operates at layer-level and protects critical layers vulnerable from staled activations; and (3) Conditional Communication, a token-level, training-free method that dynamically adjusts communication frequency based on token importance. Together, these strategies effectively reduce staleness, achieving 1.26x speedup with minimal quality degradation. Empirical results establish DICE as an effective and scalable solution. Our code is publicly available at https://anonymous.4open.science/r/DICE-FF04
Poster
Jiwon Kim · Pureum Kim · SeonHwa Kim · Soobin Park · Eunju Cha · Kyong Hwan Jin

[ Exhibit Hall I ]

Abstract
Recent advancements in controllable text-to-image (T2I) diffusion models, such as Ctrl-X and FreeControl, have demonstrated robust spatial and appearance control without requiring auxiliary module training. However, these models often struggle to accurately preserve spatial structures and fail to capture fine-grained conditions related to object poses and scene layouts. To address these challenges, we propose a training-free Dual Recursive Feedback (DRF) system that properly reflects control conditions in controllable T2I models. The proposed DRF consists of appearance feedback and generation feedback that recursively refines the intermediate latents to better reflect the given appearance information and the user's intent.This dual-update mechanism guides latent representations toward reliable manifolds, effectively integrating structural and appearance attributes. Our approach enables fine-grained generation even between class-invariant structure-appearance fusion, such as transferring human motion onto a tiger's form. Extensive experiments demonstrate the efficacy of our method in producing high-quality, semantically coherent, and structurally consistent image generations.
Poster
Zelin Li · Ruohan Zong · Yifan Liu · Ruichen Yao · Yaokun Liu · Yang Zhang · Dong Wang

[ Exhibit Hall I ]

Abstract
With the advancement of personalized image generation technologies, concerns about forgery attacks that infringe on portrait rights and privacy are growing. To address these concerns, protection perturbation algorithms have been developed to disrupt forgery generation. However, the protection algorithms would become ineffective when forgery attackers apply purification techniques to bypass the protection. To address this issue, we present a novel approach, $\textbf{Anti-Tamper Perturbation (ATP)}$. ATP introduces a tamper-proofing mechanism within the perturbation. It consists of $\textit{protection}$ and $\textit{authorization}$ perturbations, where the protection perturbation defends against forgery attacks, while the authorization perturbation detects purification-based tampering. Both protection and authorization perturbations are applied in the frequency domain under the guidance of a mask, ensuring that the protection perturbation does not disrupt the authorization perturbation. This design also enables the authorization perturbation to be distributed across all image pixels, preserving its sensitivity to purification-based tampering.ATP demonstrates its effectiveness in defending forgery attacks across various attack settings through extensive experiments, providing a robust solution for protecting individuals' portrait rights and privacy.
Poster
Yu-Chien Liao · Jr-Jen Chen · Chi-Pin Huang · Ci-Siang Lin · Meng-Lin Wu · Yu-Chiang Frank Wang

[ Exhibit Hall I ]

Abstract
Updating diffusion models in an incremental setting would be practical in real-world applications yet computationally challenging. We present a novel learning strategy of $\textbf{C}$oncept $\textbf{N}$euron $\textbf{S}$election, a simple yet effective approach to perform personalization in a continual learning scheme. $\textbf{CNS}$ uniquely identifies neurons in diffusion models that are closely related to the target concepts. In order to mitigate catastrophic forgetting problems while preserving zero-shot text-to-image generation ability, $\textbf{CNS}$ finetunes concept neurons in an incremental manner and jointly preserves knowledge learned of previous concepts. Evaluation of real-world datasets demonstrates that $\textbf{CNS}$ achieves state-of-the-art performance with minimal parameter adjustments, outperforming previous methods in both single and multi-concept personalization works. $\textbf{CNS}$ also achieves fusion-free operation, reducing memory storage and processing time for continual personalization.
Poster
Yanbing Zhang · Zhe Wang · Qin Zhou · Mengping Yang

[ Exhibit Hall I ]

Abstract
In light of recent breakthroughs in text-to-image (T2I) generation, particularly with diffusion transformers (DiT), subject-driven technologies are increasingly being employed for high-fidelity customized production that preserves subject identity from reference inputs, enabling thrilling design workflows and engaging entertainment. Existing alternatives typically require either per-subject optimization via trainable text embeddings or training specialized encoders for subject feature extraction on large-scale datasets. Such dependencies on training procedures fundamentally constrain their practical applications. More importantly, current methodologies fail to fully leverage the inherent zero-shot potential of modern diffusion transformers (e.g., the Flux series) for authentic subject-driven synthesis. To bridge this gap, we propose FreeCus, a genuinely training-free framework that activates DiT's capabilities through three key innovations: 1) We introduce a pivotal attention sharing mechanism that captures the subject's layout integrity while preserving crucial editing flexibility. 2) Through a straightforward analysis of DiT's dynamic shifting, we propose an upgraded variant that significantly improves fine-grained feature extraction. 3) We further integrate advanced Multimodal Large Language Models (MLLMs) to enrich cross-modal semantic representations. Extensive experiments reflect that our method successfully unlocks DiT's zero-shot ability for consistent subject synthesis across diverse contexts, achieving state-of-the-art or comparable results compared to approaches that require additional training. Notably, our framework …
Poster
Zhongyu Yang · Jun Chen · Dannong Xu · Junjie Fei · Xiaoqian Shen · Liangbing Zhao · Chun-Mei Feng · Mohamed Elhoseiny

[ Exhibit Hall I ]

Abstract
Knowledge discovery and collection are intelligence-intensive tasks that traditionally require significant human effort to ensure high-quality outputs. Recent research has explored multi-agent frameworks for automating Wikipedia-style article generation by retrieving and synthesizing information from the internet. However, these methods primarily focus on text-only generation, overlooking the importance of multimodal content in enhancing informativeness and engagement. In this work, we introduce WikiAutoGen, a novel system for automated multimodal Wikipedia-style article generation. Unlike prior approaches, WikiAutoGen retrieves and integrates relevant images alongside text, enriching both the depth and visual appeal of generated content. To further improve factual accuracy and comprehensiveness, we propose a multi-perspective self-reflection mechanism, which critically assesses retrieved content from diverse viewpoints to enhance reliability, breadth, and coherence, etc. Additionally, we introduce WikiSeek, a benchmark comprising Wikipedia articles with topics paired with both textual and image-based representations, designed to evaluate multimodal knowledge generation on more challenging topics. Experimental results show that WikiAutoGen outperforms previous methods by 8\%-29\% on our WikiSeek benchmark, producing more accurate, coherent, and visually enriched Wikipedia-style articles. We show some of our generated examples in \url{https://anonymous.4open.science/r/WikiAutoGen-C3C4}
Poster
Haoxuan Wang · Yuzhang Shang · Rui Xie · Junyi Wu · Junchi Yan · Yan Yan

[ Exhibit Hall I ]

Abstract
The practical deployment of diffusion models is still hindered by the high memory and computational overhead. Although quantization paves a way for model compression and acceleration, existing methods face challenges in achieving low-bit quantization efficiently. In this paper, we identify imbalanced activation distributions as a primary source of quantization difficulty, and propose to adjust these distributions through weight finetuning to be more quantization-friendly. We provide both theoretical and empirical evidence supporting finetuning as a practical and reliable solution. Building on this approach, we further distinguish two critical types of quantized layers: those responsible for retaining essential temporal information and those particularly sensitive to bit-width reduction. By selectively finetuning these layers under both local and global supervision, we mitigate performance degradation while enhancing quantization efficiency.Our method demonstrates its efficacy across three high-resolution image generation tasks, obtaining state-of-the-art performance across multiple bit-width settings.
Poster
Feihong Yan · qingyan wei · Jiayi Tang · Jiajun Li · Yulin Wang · Xuming Hu · Huiqi Li · Linfeng Zhang

[ Exhibit Hall I ]

Abstract
Masked Autoregressive (MAR) models have emerged as a promising approach in image generation, expected to surpass traditional autoregressive models in computational efficiency by leveraging the capability of parallel decoding. However, their dependence on bidirectional self-attention inherently conflicts with conventional KV caching mechanisms, creating unexpected computational bottlenecks that undermine their expected efficiency. To address this problem, this paper studies the caching mechanism for MAR by leveraging two types of redundancy:Token Redundancy indicates that a large portion of tokens have very similar representations in the adjacent decoding steps, which allows us to first cache them in previous steps and then reuse them in the later steps. Condition Redundancy indicates that the difference between conditional and unconditional output in classifier-free guidance exhibits very similar values in adjacent steps. Based on these two redundancies, we propose LazyMAR, which introduces two caching mechanisms to handle them one by one. LazyMAR is training-free and plug-and-play for all MAR models. Experimental results demonstrate that our method achieves 2.83× acceleration with almost no drop in generation quality. Our codes have been released in the supplementary material and will be released in Github.
Poster
XIN Hu · Ke Qin · Guiduo Duan · Ming Li · Yuan-Fang Li · Tao He

[ Exhibit Hall I ]

Abstract
Panoptic Scene Graph Generation (PSG) integrates instance segmentation with relation understanding to capture pixel-level structural relationships in complex scenes. Although recent approaches leveraging pre-trained vision-language models (VLMs) have significantly improved performance in the open-vocabulary setting, they commonly ignore the inherent limitations of VLMs in spatial relation reasoning, such as difficulty in distinguishing object relative positions, which results in suboptimal relation prediction.Motivated by the denoising diffusion model's inversion process in preserving the spatial structure of input images, we propose SPADE (SPatial-Aware Denoising-nEtwork) framework---a novel approach for open-vocabulary PSG. SPADE consists of two key steps: (1) inversion-guided calibration for the UNet adaption, and (2) spatial-aware context reasoning. In the first step, we calibrate a general pre-trained teacher diffusion model into a PSG-specific denoising network with cross-attention maps derived during inversion through a lightweight LoRA-based fine-tuning strategy. In the second step, we develop a spatial-aware relation graph transformer that captures both local and long-range contextual information, facilitating the generation of high-quality relation queries. Extensive experiments on benchmark PSG and Visual Genome datasets demonstrate that SPADE outperforms state-of-the-art methods in both closed-set and open-set scenarios, particularly excelling in spatial relationship prediction. The code is available at: https://anonymous.4open.science/r/SPADE-105F.
Poster
ZHANG YINGWEN · Meng Wang · Xihua Sheng · Peilin CHEN · Junru Li · Li Zhang · Shiqi Wang

[ Exhibit Hall I ]

Abstract
Lossy image compression networks aim to minimize the latent entropy of images while adhering to specific distortion constraints. However, optimizing the neural network can be challenging due to its nature of learning quantized latent representations. In this paper, our key finding is that minimizing the latent entropy is, to some extent, equivalent to maximizing the conditional source entropy, an insight that is deeply rooted in information-theoretic equalities. Building on this insight, we propose a novel structural regularization method for the neural image compression task by incorporating the negative conditional source entropy into the training objective, such that both the optimization efficacy and the model's generalization ability can be promoted. The proposed information-theoretic regularizer is interpretable, plug-and-play, and imposes no inference overheads. Extensive experiments demonstrate its superiority in regularizing the models and further squeezing bits from the latent representation across various compression structures and unseen domains.
Poster
Runze Zhang · Guoguang Du · Xiaochuan Li · Qi Jia · Liang Jin · Lu Liu · Jingjing Wang · Cong Xu · Zhenhua Guo · Yaqian Zhao · Xiaoli Gong · Rengang Li · Baoyu Fan

[ Exhibit Hall I ]

Abstract
Spatio-temporal consistency is a critical research topic in video generation. A qualified generated video segment must ensure plot plausibility and coherence while maintaining visual consistency of objects and scenes across varying viewpoints. Prior research, especially in open-source projects, primarily focuses on either temporal or spatial consistency, or their basic combination, such as appending a description of a camera movement after a prompt without constraining the outcomes of this movement. However, camera movement may introduce new objects to the scene or eliminate existing ones, thereby overlaying and affecting the preceding narrative. Especially in videos with numerous camera movements, the interplay between multiple plots becomes increasingly complex. This paper introduces and examines integral spatio-temporal consistency, considering the synergy between plot progression and camera techniques, and the long-term impact of prior content on subsequent generation. Our research encompasses dataset construction through to the development of the model. Initially, we constructed a DropletVideo-10M dataset, which comprises 10 million videos featuring dynamic camera motion and object actions. Each video is annotated with an average caption of 206 words, detailing various camera movements and plot developments. Following this, we developed and trained the DropletVideo model, which excels in preserving spatio-temporal coherence during video generation. The DropletVideo …
Poster
Ce Wang · Zhenyu Hu · Wanjie Sun · Zhenzhong Chen

[ Exhibit Hall I ]

Abstract
Image rescaling aims to learn the optimal low-resolution (LR) image that can be accurately reconstructed to its original high-resolution (HR) counterpart, providing an efficient image processing and storage method for ultra-high definition media. However, extreme downscaling factors pose significant challenges to the upscaling process due to its highly ill-posed nature, causing existing image rescaling methods to struggle in generating semantically correct structures and perceptual friendly textures. In this work, we propose a novel framework called Timestep-Aware Diffusion Model (TADM) for extreme image rescaling, which performs rescaling operations in the latent space of a pre-trained autoencoder and effectively leverages powerful natural image priors learned by a pre-trained text-to-image diffusion model. Specifically, TADM adopts a pseudo-invertible module to establish the bidirectional mapping between the latent features of the HR image and the target-sized LR image. Then, the rescaled latent features are enhanced by a pre-trained diffusion model to generate more faithful details. Considering the spatially non-uniform degradation caused by the rescaling operation, we propose a novel time-step alignment strategy, which can adaptively allocate the generative capacity of the diffusion model based on the quality of the reconstructed latent features. Extensive experiments demonstrate the superiority of TADM over previous methods in both quantitative …
Poster
Aoxiong Yin · Kai Shen · Yichong Leng · Xu Tan · Xinyu Zhou · Juncheng Li · Siliang Tang

[ Exhibit Hall I ]

Abstract
Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive language models and diffusion models. However, each paradigm has intrinsic limitations: language models struggle with visual quality and error accumulation, while diffusion models lack semantic understanding and causal modeling. In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations through efficient semantic compression, achieving a $\sim$14,000$\times$ compression ratio; (2) a language model that generates semantic tokens with high-level semantic relationships; (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos. Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing the state-of-the-art open-source models Hunyuan Video (13B) and other commercial models such as Sora, Kling, and Hailuo. Furthermore, our model also achieves state-of-the-art performance in long video generation, surpassing other open-source models in this field. Our demo can be viewed at https://anonymoust2v.github.io/ .
Poster
Zhen Zhang · Zhen Zhang · Qianlong Dang · Zhize Wu · LiChuan Gu

[ Exhibit Hall I ]

Abstract
Single domain generalization aims to learn a model with good generalization ability from a single source domain. Recent advances in this field have focused on increasing the diversity of the training data through style (e.g., color and texture) augmentation. However, most existing methods apply uniform perturbations to the entire image, failing to simulate complex images with multiple distinct stylistic regions. To address this, we propose a ``Split-And-Combine" (SAC) strategy to enhance style diversity. Specifically, SAC first performs patch-aware augmentation, which splits an image into multiple patches and applies style augmentation independently to each patch, enabling distinct color variations across regions. Then, SAC combines these patches to reconstruct a complete image and applies adaptive random convolutions, which utilizes a deformable convolution layer with random and Gaussian filters to enhance texture diversity while preserving object integrity. Notably, SAC leverages entropy as a risk assessment criterion to adaptively determine whether a sample should undergo augmentation within the iterative process of random convolutions, preventing excessive augmentation. Furthermore, SAC introduces an energy-based distribution discrepancy score to quantify out-of-distribution likelihood, systematically expanding the augmented data's distribution. SAC can serve as a plug-and-play component to improve the performance of recent methods. Extensive experiments on four datasets demonstrate …
Poster
Wonwoong Cho · Yan-Ying Chen · Matthew Klenk · David I. Inouye · Yanxia Zhang

[ Exhibit Hall I ]

Abstract
Text-to-Image (T2I) Diffusion Models have achieved remarkable performance in generating high quality images. However, enabling precise control of continuous attributes, especially multiple attributes simultaneously, in a new domain (e.g., numeric values like eye openness or car width) with text-only guidance remains a significant challenge. To address this, we introduce the **Attribute (Att) Adapter**, a novel plug-and-play module designed to enable fine-grained, multi-attributes control in pretrained diffusion models. Our approach learns a single control adapter from a set of sample images that can be unpaired and contain multiple visual attributes. The Att-Adapter leverages the decoupled cross attention module to naturally harmonize the multiple domain attributes with text conditioning.We further introduce Conditional Variational Autoencoder (CVAE) to the Att-Adapter to mitigate overfitting, matching the diverse nature of the visual world.Evaluations on two public datasets show that Att-Adapter outperforms all LoRA-based baselines in controlling continuous attributes. Additionally, our method enables a broader control range and also improves disentanglement across multiple attributes, surpassing StyleGAN-based techniques. Notably, Att-Adapter is flexible, requiring no paired synthetic data for training, and is easily scalable to multiple attributes within a single model.
Poster
Jiale Cheng · Ruiliang Lyu · Xiaotao Gu · Xiao Liu · Jiazheng Xu · Yida Lu · Jiayan Teng · Zhuoyi Yang · Yuxiao Dong · Jie Tang · Hongning Wang · Minlie Huang

[ Exhibit Hall I ]

Abstract
Video generation models have made remarkable progress in recent years, demonstrating outstanding performance in text-to-video tasks.These models are typically trained on text-video pairs with highly detailed and carefully crafted descriptions, while real-world user inputs during inference are often concise, vague, or poorly structured.This gap makes prompt optimization crucial for generating high-quality videos.Current methods often rely on large language models (LLMs) to refine prompts through in-context learning, but suffer from several limitations: they may distort user intent, omit critical details, or introduce safety risks.Moreover, they optimize prompts without considering the impact on the final video quality, which can lead to suboptimal results.To address these issues, we introduce VPO, a principled framework that optimizes prompts based on three core principles: harmlessness, accuracy, and helpfulness.The generated prompts faithfully preserve user intents and, more importantly, enhance the safety and quality of generated videos.To achieve this, VPO employs a two-stage optimization approach.First, we construct and refine a supervised fine-tuning (SFT) dataset based on principles of safety and alignment. Second, we introduce both text-level and video-level feedback to further optimize the SFT model with preference learning. Our extensive experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to traditional prompt optimization methods, such …
Poster
Bimsara Pathiraja · Maitreya Patel · Shivam Singh · Yezhou Yang · Chitta Baral

[ Exhibit Hall I ]

Abstract
Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To quantify this gap, we first introduce **`RefEdit-Bench`**, a rigorous real-world benchmark rooted in RefCOCO, where even baselines trained on millions of samples perform poorly.To overcome this limitation, we introduce **`RefEdit`** -- an instruction-based editing model trained on our scalable synthetic data generation pipeline.Our **`RefEdit`**, trained on only 20,000 editing triplets, outperforms the Flux/SD3 model-based baselines trained on millions of data. Extensive evaluations across various benchmarks demonstrate that our model not only excels in referring expression tasks but also enhances performance on traditional benchmarks, achieving state-of-the-art results comparable to closed-source methods.We will release our code, data, and checkpoints.
Poster
Shufan Li · Konstantinos Kallidromitis · Akash Gokul · Arsh Koneru · Yusuke Kato · Kazuki Kozuka · Aditya Grover

[ Exhibit Hall I ]

Abstract
The predominant approach to advancing text-to-image generation has been training-time scaling, where larger models are trained on more data using greater computational resources. While effective, this approach is computationally expensive, leading to growing interest in inference-time scaling to improve performance. Currently, inference-time scaling for text-to-image diffusion models is largely limited to best-of-N sampling, where multiple images are generated per prompt and a selection model chooses the best output. Inspired by the recent success of reasoning models like DeepSeek-R1 in the language domain, we introduce an alternative to naive best-of-N sampling by equipping text-to-image Diffusion Transformers with in-context reflection capabilities. We propose Reflect-DiT, a method that enables Diffusion Transformers to refine their generations using in-context examples of previously generated images alongside textual feedback describing necessary improvements. Instead of passively relying on random sampling and hoping for a better result in a future generation, Reflect-DiT explicitly tailors its generations to address specific aspects requiring enhancement. Experimental results demonstrate that Reflect-DiT improves performance on the GenEval benchmark (+0.19) using SANA-1.0-1.6B as a base model. Additionally, it achieves a new state-of-the-art score of 0.81 on GenEval while generating only 20 samples per prompt, surpassing the previous best score of 0.80, which was obtained using …
Poster
Wen Qian

[ Exhibit Hall I ]

Abstract
Diffusion techniques has significantly advanced the development of virtual try-on. However, these methods often struggle to preserve intricate details, such as patterns, texts and faces, etc. To tackle this challenge, we introduce a plug-and-play module named as "TryOn-Refiner", which can refine the detailed artifacts for any try-on results in only $1\sim10$ steps.Instead of the previous diffusion-based refine module, TryOn-Refiner employs the conditional rectified-flow-based mechanism for better leveraging prior information from coarse try-on results. Specifically, TryOn-Refiner transforms the traditional refinement framework from a noise-to-image paradigm into a flow mapping framework that directly maps coarse images to refined images, essentially avoiding introducing uncertainty in the refinement process.Moreover, we propose a training data construction pipeline, which can efficiently generate paired training data and includes a data smoothing strategy to overcome the blocking artifact.Extended experimental results demonstrate our TryOn-Refiner consistently improves performance with only a few inference steps for all evaluated existing try-on methods.
Poster
Jianhong Bai · Menghan Xia · Xiao Fu · Xintao Wang · Lianrui Mu · Jinwen Cao · Zuozhu Liu · Haoji Hu · Xiang Bai · Pengfei Wan · Di ZHANG

[ Exhibit Hall I ]

Abstract
Camera control has been actively studied in text or image conditioned video generation tasks. However, altering camera trajectories of a given video remains under-explored, despite its importance in the field of video creation. It is non-trivial due to the extra constraints of maintaining multiple-frame appearance and dynamic synchronization. To address this, we present ReCamMaster, a camera-controlled generative video re-rendering framework that reproduces the dynamic scene of an input video at novel camera trajectories. The core innovation lies in harnessing the generative capabilities of pre-trained text-to-video models through an elegant yet powerful video conditioning mechanism—an aspect often overlooked in current research. To overcome the scarcity of qualified training data, we construct a comprehensive multi-camera synchronized video dataset using Unreal Engine 5, which is carefully curated to follow real-world filming characteristics, covering diverse scenes and camera movements. It helps the model generalize to in-the-wild videos. Lastly, we further improve the robustness to diverse inputs through a meticulously designed training strategy. Extensive experiments tell that our method substantially outperforms existing state-of-the-art approaches and strong baselines. Our method also finds promising applications in video stabilization, super-resolution, and outpainting. Our code and dataset will be publicly available.
Poster
Xianglong He · Zi-Xin Zou · Chia Hao Chen · Yuan-Chen Guo · Ding Liang · Chun Yuan · Wanli Ouyang · Yanpei Cao · Yangguang Li

[ Exhibit Hall I ]

Abstract
Creating high-fidelity 3D meshes with arbitrary topology, including open surfaces and complex interiors, remains a significant challenge. Existing implicit field methods often require costly and detail-degrading watertight conversion, while other approaches struggle with high resolutions. This paper introduces SparseFlex, a novel sparse-structured isosurface representation that enables differentiable mesh reconstruction at resolutions up to 1024^3 directly from rendering losses. SparseFlex combines the accuracy of Flexicubes with a sparse voxel structure, focusing computation on surface-adjacent regions and efficiently handling open surfaces. Crucially, we introduce a frustum-aware sectional voxel training strategy that activates only relevant voxels during rendering, dramatically reducing memory consumption and enabling high-resolution training. This also allows, for the first time, the reconstruction of mesh interiors using only rendering supervision. Building upon this, we demonstrate a complete shape modeling pipeline by training a variational autoencoder (VAE) and a rectified flow transformer for high-quality 3D shape generation. Our experiments show state-of-the-art reconstruction accuracy, with a ~82% reduction in Chamfer Distance and a ~88% increase in F-score compared to previous methods, and demonstrate the generation of high-resolution, detailed 3D shapes with arbitrary topology. By enabling high-resolution, differentiable mesh reconstruction and generation with rendering losses, SparseFlex significantly advances the state-of-the-art in 3D shape representation …
Poster
Aniket Rege · Zinnia Nie · Unmesh Raskar · Mahesh Ramesh · Zhuoran Yu · Aditya Kusupati · Yong Jae Lee · Ramya Vinayak

[ Exhibit Hall I ]

Abstract
Popular text-to-image (T2I) models are trained on web-scraped data, which is heavily Amero and Euro-centric, underrepresenting the cultures of the Global South. To analyze these biases, we introduce CuRe, a novel benchmarking and scoring suite for cultural representativeness that leverages the marginal utility of attribute specification to text-to-image systems as a proxy for human judgments. Our CuRe dataset has a novel categorical hierarchy that enables benchmarking T2I systems in this manner, with 32 cultural subcategories across six broad cultural axes (food, art, fashion, architecture, celebrations, and people), built from the crowdsourced Wikimedia knowledge graph. Unlike flawed existing benchmarks, which suffer from ``generative entanglement'' due to overlapping training and evaluation data, CuRe enables fine-grained cultural comparisons. We empirically observe much stronger correlations of our class of scorers to human judgments of perceptual similarity, image-text alignment, and cultural diversity across image encoders (SigLIP2, AIMv2 and DINOv2), image-text models (CLIP, SigLIP) and state-of-the-art text-to-image systems including Stable Diffusion 3.5 Large and Flux.1. Code and benchmark dataset is available at: \textbf{hidden for double blind}
Poster
Yu Cheng · Fajie Yuan

[ Exhibit Hall I ]

Abstract
Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latent space. However, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose \textbf{LeanVAE}, a novel and ultra-efficient Video VAE framework that introduce two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE’s superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Video VAEs. Our model offers up to 50× fewer FLOPs and 44× faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient video generation. Our models and code will be made publicly available.
Poster
Felix Krause · Timy Phan · Ming Gui · Stefan A. Baumann · Vincent Tao Hu · Björn Ommer

[ Exhibit Hall I ]

Abstract
Diffusion models have emerged as the mainstream approach for visual generation. However, these models typically suffer from sample inefficiency and high training costs. Consequently, methods for efficient finetuning, inference and personalization were quickly adopted by the community. However, training these models in the first place still remains very costly. While several recent approaches—including masking, distillation, and architectural modifications—have been proposed to improve training efficiency, each of these methods comes with its own tradeoffs: some achieve enhanced performance at the expense of increased computational cost. In contrast, this work aims to improve training efficiency as well as generative performance at the same time through routes that act as transport mechanism for randomly selected tokens from early layers to deeper layers of the model. Our method is not limited to the common transformer-based model - it can also be applied to state-space models and achieves this without architectural modifications or additional parameters. Finally, we show that TREAD reduces the computational cost and simultaneously boosts model performance on the standard benchmark ImageNet-256 in class-conditional synthesis. Both of these benefits multiply to a convergence speedup of 14x at 400K training iterations compared to DiT and 37x compared to the best benchmark performance of DiT …
Poster
Zeyi Sun · Tong Wu · Pan Zhang · Yuhang Zang · Xiaoyi Dong · Yuanjun Xiong · Dahua Lin · Jiaqi Wang

[ Exhibit Hall I ]

Abstract
Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D data with detailed captions. To address this challenge, we propose Bootstrap3D, a novel framework that automatically generates filtered multi-view images to assist in training multi-view diffusion models. Specifically, we introduce a data generation pipeline that employs (1) 2D and video diffusion models to generate multi-view images based on constructed text prompts, and (2) our fine-tuned 3D-aware MV-LLaVA for filtering data and rewriting inaccurate captions. Leveraging this pipeline, we have generated large scale synthetic multi-view images with dense descriptive captions. Furthermore, we present a Training Timestep Reschedule (TTR) strategy that leverages the denoising process to learn multi-view consistency while maintaining the original 2D diffusion prior. Extensive experiments demonstrate that Bootstrap3D can generate high-quality multi-view images with superior aesthetic quality, image-text alignment, and view consistency.
Poster
Shiyu Qin · Jinpeng Wang · Yimin Zhou · Bin Chen · Tianci Luo · Baoyi An · Tao Dai · Shu-Tao Xia · Yaowei Wang

[ Exhibit Hall I ]

Abstract
Learned image compression (LIC) demonstrates superior rate-distortion (RD) performance compared to traditional methods. Recent method MambaVC attempts to introduce Mamba, a variant of state space models, into this field aim to establish a new paradigm beyond convolutional neural networks and transformers. However, this approach relies on predefined four-directional scanning, which prioritizes spatial proximity over content and semantic relationships, resulting in suboptimal redundancy elimination. Additionally, it focuses solely on nonlinear transformations, neglecting entropy model improvements crucial for accurate probability estimation in entropy coding. To address these limitations, we propose a novel framework based on content-adaptive visual state space model, Cassic, through dual innovation.First, we design a content-adaptive selective scan based on weighted activation maps and bit allocation maps, subsequently developing a content-adaptive visual state space block. Second, we present a mamba-based channel-wise auto-regressive entropy model to fully leverage inter-slice bit allocation consistency for enhanced probability estimation. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across three datasets while maintaining faster processing speeds than existing MambaVC approach.
Poster
Xuan Ju · Weicai Ye · Quande Liu · Qiulin Wang · Xintao Wang · Pengfei Wan · Di ZHANG · Kun Gai · Qiang Xu

[ Exhibit Hall I ]

Abstract
Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they suffer from three key limitations: **branch conflicts** between independently trained adapters, **parameter redundancy** leading to increased computational cost, and **suboptimal performance** compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions—including text, camera, identities, and depth—via full-attention mechanisms. By directly fusing multimodal conditions into a unified sequence representation, FullDiT significantly reduces parameter overhead, avoids conflicts common in adapter-based methods, and shows scalability and emergent ability. We further introduce FullBench, a new benchmark designed specifically for multi-condition video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of unified full-attention in complex multimodal video tasks.
Poster
Rishubh Parihar · Sachidanand VS · Venkatesh Babu Radhakrishnan

[ Exhibit Hall I ]

Abstract
Diffusion models have transformed image editing but struggle with precise depth-aware control, such as placing objects at a specified depth. Layered representations offer fine-grained control by decomposing an image into separate editable layers. However, existing methods simplistically represent a scene via a set of background and transparent foreground layers while ignoring the scene geometry - limiting their effectiveness for depth-aware editing. We propose \textbf{D}epth-\textbf{G}uided \textbf{L}ayer \textbf{D}ecomposition - a layering method that decomposes an image into foreground and background layers based on a \textbf{user-specified depth value}, enabling precise depth-aware edits. We further propose \textbf{F}eature \textbf{G}uided \textbf{L}ayer \textbf{C}ompositing - a zero-shot approach for realistic layer compositing by leveraging generative priors from pretrained diffusion models. Specifically, we guide the internal U-Net features to progressively fuse individual layers into a composite latent at each denoising step. This preserves the structure of individual layers while generating realistic outputs with appropriate color and lighting adjustments without a need for post-hoc harmonization models. We demonstrate our method on two key depth-aware editing tasks: \textbf{1)} scene compositing by blending the foreground of one scene with the background of another at a specified depth, and; \textbf{2)} object insertion at a user-defined depth. Our zero-shot approach achieves precise depth ordering …
Poster
Jaeseok Jeong · Junho Kim · Youngjung Uh · Gayoung Lee · Yunjey Choi

[ Exhibit Hall I ]

Abstract
In the domain of text-to-image generation, diffusion models have emerged as powerful tools. Recently, studies on visual prompting, where images are used as prompts, have enabled more precise control over style and content. However, existing methods often suffer from content leakage, where undesired elements of the visual style prompt are transferred along with the intended style. To address this issue, we 1) extend classifier-free guidance (CFG) to utilize swapping self-attention and propose 2) negative visual query guidance (NVQG) to reduce the transfer of unwanted contents. NVQG employs negative score by intentionally simulating content leakage scenarios that swap queries instead of key and values of self-attention layers from visual style prompts. This simple yet effective method significantly reduces content leakage. Furthermore, we provide careful solutions for using a real image as visual style prompts.Through extensive evaluation across various styles and text prompts, our method demonstrates superiority over existing approaches, reflecting the style of the references, and ensuring that resulting images match the text prompts.
Poster
Srikumar Sastry · Aayush Dhakal · Eric Xing · Subash Khanal · Nathan Jacobs

[ Exhibit Hall I ]

Abstract
Learning the hierarchical structure of data in vision-language models is a significant challenge. Previous works have attempted to address this challenge by employing entailment learning. However, these approaches fail to model the transitive nature of entailment explicitly, which establishes the relationship between order and semantics within a representation space. In this work, we introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the explicit modeling of transitivity-enforced entailment. Our proposed framework optimizes for the partial order of concepts within vision-language models. By leveraging our framework, we develop a hierarchical vision-language foundation model capable of representing the hierarchy in the Tree of Life. Our experiments on hierarchical species classification and hierarchical retrieval tasks demonstrate the enhanced performance of our models compared to the existing state-of-the-art models. Our code and models will be open-sourced.
Poster
Sucheng Ren · Qihang Yu · Ju He · Xiaohui Shen · Alan Yuille · Liang-Chieh (Jay) Chen

[ Exhibit Hall I ]

Abstract
Autoregressive (AR) modeling, known for its next-token prediction paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a ``token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference.In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a $k\times k$ grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as continuous entity regression, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning, which effectively alleviates exposure bias.As a result, xAR offers two key advantages: (1) it enables flexible prediction units that capture different contextual granularity and spatial structures, and (2) it mitigates exposure bias by avoiding reliance on teacher forcing.On ImageNet-256 generation benchmark, our base model, …
Poster
Zijun Zhou · Yingying Deng · Xiangyu He · Weiming Dong · Fan Tang

[ Exhibit Hall I ]

Abstract
Many real-world applications, such as interactive photo retouching, artistic content creation, and product design, require flexible and iterative image editing. However, existing image editing methods primarily focus on achieving the desired modifications in a single step, which often struggles with ambiguous user intent, complex transformations, or the need for progressive refinements. As a result, these methods frequently produce inconsistent outcomes or fail to meet user expectations. To address these challenges, we propose a multi-turn image editing framework that enables users to iteratively refine their edits, progressively achieving more satisfactory results. Our approach leverages flow matching for accurate image inversion and a dual-objective Linear Quadratic Regulators (LQR) for stable sampling, effectively mitigating error accumulation. Additionally, by analyzing the layer-wise roles of transformers, we introduce a adaptive attention highlighting method that enhances editability while preserving multi-turn coherence. Extensive experiments demonstrate that our framework significantly improves edit success rates and visual fidelity compared to existing methods.
Poster
Xinbo Wang · Wenju Xu · Qing Zhang · Wei-Shi Zheng

[ Exhibit Hall I ]

Abstract
This paper presents a portrait style transfer method that generalizes well to various different domains while enabling high-quality semantic-aligned stylization on regions including hair, eyes, eyelashes, skins, lips, and background. To this end, we propose to establish dense semantic correspondence between the given input and reference portraits based on a pre-trained model and a semantic adapter, with which we obtain a warped reference semantically aligned with the input. To ensure effective yet controllable style transfer, we devise an AdaIN-Wavelet transformation to balance content preservation and stylization by blending low-frequency information of the warped reference with high-frequency information of the input in the latent space. A style adapter is also designed to provide style guidance from the warped reference. With the stylized latent from AdaIN-Wavelet transformation, we employ a dual-conditional diffusion model that integrates a ControlNet recording high-frequency information and the style guidance to generate the final result. Extensive experiments demonstrate the superiority of our method. Our code and trained model will be made publicly available.
Poster
Zhu Xu · Ting Lei · Zhimin Li · Guan Wang · Qingchao Chen · Yuxin Peng · Yang Liu

[ Exhibit Hall I ]

Abstract
Dynamic Scene Graph Generation (DSGG) aims to create a scene graph for each video frame by detecting objects and predicting their relationships. Weakly Supervised DSGG (WS-DSGG) reduces annotation workload by using an unlocalized scene graph from a single frame per video for training. Existing WS-DSGG methods depend on an off-the-shelf external object detector to generate pseudo labels for subsequent DSGG training. However, detectors trained on static, object-centric images struggle in dynamic, relation-aware scenarios required for DSGG, leading to inaccurate localization and low-confidence proposals. To address the challenges posed by external object detectors in WS-DSGG, we propose a Temporal-enhanced In-domain Knowledge Transferring (TIKT) method, which leverages in-domain knowledge to enhance detection in relation-aware dynamic scenarios. TIKT is built on two key components: (1)In-domain knowledge mining: we first employ object and relation class decoders that generate category-specific attention maps to highlight both object regions and interactive areas, facilitating attention maps relation-aware. Then we propose an Inter-frame Attention Augmentation strategy that exploits neighboring frames and optical flow information to enhance these attention maps, making them motion-aware and robust to motion blur. This step yields relation- and motion-aware in-domain knowledge mining for WS-DSGG. (2) we introduce a Dual-stream Fusion Module that integrates category-specific attention …
Poster
Yuran Dong · Mang Ye

[ Exhibit Hall I ]

Abstract
To advance real-world fashion image editing, we analyze existing two-stage pipelines—mask generation followed by diffusion-based editing—which overly prioritize generator optimization while neglecting mask controllability. This results in two critical limitations: I) poor user-defined flexibility (coarse-grained human masks restrict edits to predefined regions like upper torso; fine-grained clothes masks preserve poses but forbid style/length customization). II) weak pose robustness (mask generators fail due to articulated poses and miss rare regions like waist, while human parsers remain limited by predefined categories).To address these gaps, we propose Pose-Star, a framework that dynamically recomposes body structures (e.g., neck, chest, etc.) into anatomy-aware masks (e.g., chest-length) for user-defined edits. In Pose-Star, we calibrate diffusion-derived attention (Star tokens) via skeletal keypoints to enhance rare structure localization in complex poses, suppress noise through phase-aware analysis of attention dynamics (Convergence→Stabilization→Divergence) with threshold masking and sliding-window fusion, and refine edges via cross-self attention merging and Canny alignment. This work bridges controlled benchmarks and open-world demands, pioneering anatomy-aware, pose-robust editing and laying the foundation for industrial fashion image editing.
Poster
Baoyue Hu · Yang Wei · Junhao Xiao · Wendong Huang · Xiuli Bi · Bin Xiao

[ Exhibit Hall I ]

Abstract
To defend against personalized generation, a new form of infringement that is more concealed and destructive, the existing copyright protection methods is to add adversarial perturbations in images. However, these methods focus solely on countering illegal personalization, neglecting the requirement for legitimate personalization. Moreover, none of these methods are capable of directly verifying and tracing the copyright from adversarial examples. In response to these limitations, we propose a traceable and authorizable copyright traceability method that embeds the copyright watermark into images through a series of invertible compound coupling modules. We introduce a novel information exchange mechanism for invertible neural network and design a contrastive learning-based optimization strategy tailored to address personalized infringement issues. Our method effectively mitigates the malicious use of unauthorized personalized generation models by inducing watermark-like artifacts and obscuring privacy details in generated images. Additionally, it facilitates copyright traceability and supports authorized legitimate personalization, thereby offering broader practical applicability. Experimental results demonstrate that our method can almost losslessly restore original image and extract copyright watermark, while achieving FID scores exceeding 300 and causing visually noticeable artifacts in unauthorized personalized images. Furthermore, it exhibits consistent robustness against adversarial purification and text prompt modifications.
Poster
Yuanhui Huang · Weiliang Chen · Wenzhao Zheng · Yueqi Duan · Jie Zhou · Jiwen Lu

[ Exhibit Hall I ]

Abstract
Autoregressive visual generation has garnered increasing attention due to its scalability and compatibility with other modalities compared with diffusion models. Most existing methods construct visual sequences as spatial patches for autoregressive generation. However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a Spectral AutoRegressive (SpectralAR) visual generation framework, which realizes causality for visual sequences from the spectral perspective. Specifically, we first transform an image into ordered spectral tokens with Nested Spectral Tokenization, representing lower to higher frequency components. We then perform autoregressive generation in a coarse-to-fine manner with the sequences of spectral tokens. By considering different levels of detail in images, our SpectralAR achieves both sequence causality and token efficiency without bells and whistles. We conduct extensive experiments on ImageNet-1K for image reconstruction and autoregressive generation, and SpectralAR achieves 3.02 gFID with only 64 tokens and 310M parameters.
Poster
Jiacheng Liu · Chang Zou · Yuanhuiyi Lyu · Junjie Chen · Linfeng Zhang

[ Exhibit Hall I ]

Abstract
Diffusion Transformers (DiT) have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications.To solve this problem, feature caching has been proposed to accelerate diffusion models by caching the features in the previous timesteps and then reusing them in the following timesteps.However, at timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, leading to a pronounced increase in errors introduced by feature caching, significantly harming the generation quality.To solve this problem, we propose TaylorSeer, which firstly shows that features of diffusion models at future timesteps can be predicted based on their values at previous timesteps.Based on the fact that features change slowly and continuously across timesteps, TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. Extensive experiments demonstrate its significant effectiveness in both image and video synthesis, especially in high acceleration ratios.For instance, it achieves an almost lossless acceleration of 4.99$\times$ on FLUX and 5.00$\times$ on HunyuanVideo without additional training. On DiT, it achieves $3.41$ lower FID compared with previous SOTA at $4.53$$\times$ acceleration.Our code is provided in the supplementary materials and will be made publicly available on GitHub.
Poster
Jiahao Zhu · Zixuan Chen · Guangcong Wang · Xiaohua Xie · Yi Zhou

[ Exhibit Hall I ]

Abstract
Recent advancements in text-to-3D generation improve the visual quality of Score Distillation Sampling (SDS) and its variants by directly connecting Consistency Distillation (CD) to score distillation.However, due to the imbalance between self-consistency and cross-consistency, these CD-based methods inherently suffer from improper conditional guidance, leading to sub-optimal generation results.To address this issue, we present \textbf{SegmentDreamer}, a novel framework designed to fully unleash the potential of consistency models for high-fidelity text-to-3D generation.Specifically, we reformulate SDS through the proposed Segmented Consistency Trajectory Distillation (SCTD), effectively mitigating the imbalance issues by explicitly defining the relationship between self- and cross-consistency.Moreover, \textbf{SCTD} partitions the Probability Flow Ordinary Differential Equation (PF-ODE) trajectory into multiple sub-trajectories and ensures consistency within each segment, which can theoretically provide a significantly tighter upper bound on distillation error.Additionally, we propose a distillation pipeline for a more swift and stable generation.Extensive experiments demonstrate that our \textbf{SegmentDreamer} outperforms state-of-the-art methods in visual quality, enabling high-fidelity 3D asset creation through 3D Gaussian Splatting (3DGS).
Poster
Ho Kei Cheng · Alex Schwing

[ Exhibit Hall I ]

Abstract
Minibatch optimal transport coupling straightens paths in unconditional flow matching. This leads to computationally less demanding inference as fewer integration steps and less complex numerical solvers can be employed when numerically solving an ordinary differential equation at test time. However, in the conditional setting, minibatch optimal transport falls short.This is because the default optimal transport mapping disregards conditions, resulting in a conditionally skewed prior distribution during training.In contrast, at test time, we have no access to the skewed prior, and instead sample from the full, unbiased prior distribution.This gap between training and testing leads to a subpar performance.To bridge this gap, we propose conditional optimal transport (C$^2$OT) that adds a conditional weighting term in the cost matrix when computing the optimal transport assignment. Experiments demonstrate that this simple fix works with both discrete and continuous conditions in 8gaussians$\to$moons, CIFAR-10, ImageNet-32$\times$32, and ImageNet-256$\times$256.Our method performs better overall compared to the existing baselines across different function evaluation budgets.Code will be made available.
Poster
Fei Peng · Junqiang Wu · Yan Li · Tingting Gao · Di ZHANG · Huiyuan Fu

[ Exhibit Hall I ]

Abstract
Existing text-to-image diffusion models have demonstrated remarkable capabilities in generating high-quality images guided by textual prompts. However, achieving multi-subject compositional synthesis with precise spatial control remains a significant challenge. In this work, we address the task of layout-controllable multi-subject synthesis (LMS), which requires both faithful reconstruction of reference subjects and their accurate placement in specified regions within a unified image. While recent advancements have separately improved layout control and subject synthesis, existing approaches struggle to simultaneously satisfy the dual requirements of spatial precision and identity preservation in this composite task. To bridge this gap, we propose MUSE, a unified synthesis framework that employs concatenated cross-attention (CCA) to seamlessly integrate layout specifications with textual guidance through explicit semantic space expansion. The proposed CCA mechanism enables bidirectional modality alignment between spatial constraints and textual descriptions without interference. Furthermore, we design a progressive two-stage training strategy that decomposes the LMS task into learnable sub-objectives for effective optimization. Extensive experiments demonstrate that MUSE achieves zero-shot end-to-end generation with superior spatial accuracy and identity consistency compared to existing solutions, advancing the frontier of controllable image synthesis. Our code and models will be made publicly available.
Poster
Wenjie Xuan · Jing Zhang · Juhua Liu · Bo Du · Dacheng Tao

[ Exhibit Hall I ]

Abstract
Recent works favored dense signals (e.g., depth, DensePose), as an alternative to sparse signals (e.g., OpenPose), to provide detailed spatial guidance for pose-guided text-to-image generation. However, dense representations raised new challenges including editing difficulties and potential inconsistencies with textual prompts. This fact motivates us to revisit sparse signals for pose guidance, owing to their simplicity and shape-agnostic nature, which remains underexplored. This paper proposes a novel Spatial-Pose ControlNet (SP-Ctrl), equipping sparse signals with robust controllability for pose-guided image generation. Specifically, we extend OpenPose to a learnable spatial representation, making keypoint embeddings discriminative and expressive. Additionally, we introduce keypoint concept learning, which encourages keypoint tokens to attend to the spatial positions of each keypoint, thus improving pose alignment. Experiments on animal- and human-centric image generation tasks demonstrate that our method outperforms recent spatially controllable text-to-image generation approaches under the sparse-pose guidance and even matches the performance of dense signal-based methods. Moreover, SP-Ctrl shows promising capabilities in diverse and cross-species generation through sparse signals.Codes will be released.
Poster
Sunghyun Park · Seokeon Choi · Hyoungwoo Park · Sungrack Yun

[ Exhibit Hall I ]

Abstract
Personalizing text-to-image diffusion models is crucial for adapting the pre-trained models to specific target concepts, enabling diverse image generation.However, fine-tuning with few images introduces an inherent trade-off between aligning with the target distribution (e.g., subject fidelity) and preserving the broad knowledge of the original model (e.g., text editability).Existing sampling guidance methods, such as classifier-free guidance (CFG) and autoguidance (AG), fail to effectively guide the output toward well-balanced space: CFG restricts the adaptation to the target distribution, while AG compromises text alignment. To address these limitations, we propose personalization guidance, a simple yet effective method leveraging an unlearned weak model conditioned on a null text prompt.Moreover, our method dynamically controls the extent of unlearning in a weak model through weight interpolation between pre-trained and fine-tuned models during inference.Unlike existing guidance methods, which rely solely on guidance scales, our method explicitly steers the outputs toward a balanced latent space without additional computational overhead. Experimental results demonstrate that our proposed guidance can improve text alignment and target distribution fidelity, integrating seamlessly with various fine-tuning strategies.
Poster
Roi Benita · Michael Finkelson · Tavi Halperin · Gleb Sterkin · Yossi Adi

[ Exhibit Hall I ]

Abstract
Foley is a key element in video production, refers to the process of adding an audio signal to a silent video while ensuring semantic and temporal alignment. In recent years, the rise of personalized content creation and advancements in automatic video-to-audio models have increased the demand for greater user control in the process. One possible approach is to incorporate text to guide audio generation. While supported by existing methods, challenges remain in ensuring compatibility between modalities, particularly when the text introduces additional information or contradicts the sounds naturally inferred from the visuals. In this work, we introduce CAFA (Controllable Automatic Foley Artist) a video-and-text-to-audio model that generates semantically and temporally aligned audio for a given video, guided by text input. CAFA is built upon a text-to-audio model and integrates video information through a modality adapter mechanism. By incorporating text, users can refine semantic details and introduce creative variations, guiding the audio synthesis beyond the expected video contextual cues. Experiments show that besides its superior quality in terms of semantic alignment and audio-visual synchronization the proposed method enable high textual controllability as demonstrated in subjective and objective evaluations.
Poster
Ju-Hyeon Nam · Dong-Hyun Moon · Sang-Chul Lee

[ Exhibit Hall I ]

Abstract
Image editing techniques have rapidly advanced, facilitating both innovative use cases and malicious manipulation of digital images. Deep learning-based methods have recently achieved high accuracy in pixel-level forgery localization, yet they frequently struggle with computational overhead and limited representation power, particularly for subtle or complex tampering. In this paper, we propose M2SFormer, a novel Transformer encoder-based framework designed to overcome these challenges. Unlike approaches that process spatial and frequency cues separately, M2SFormer unifies multi-frequency and multi-scale attentions in the skip connection, harnessing global context to better capture diverse forgery artifacts. Additionally, our framework addresses the loss of fine detail during upsampling by utilizing a global prior map—a curvature metric indicating the difficulty of forgery localization—which then guides a difficulty-guided attention module to preserve subtle manipulations more effectively. Extensive experiments on multiple benchmark datasets demonstrate that M2SFormer outperforms existing state-of-the-art models, offering superior generalization in detecting and localizing forgeries across unseen domains.
Poster
Jimyeong Kim · Jungwon Park · Yeji Song · Nojun Kwak · Wonjong Rhee

[ Exhibit Hall I ]

Abstract
Rectified Flow text-to-image models surpass diffusion models in image quality and text alignment, but adapting ReFlow for real-image editing remains challenging. We propose a new real-image editing method for ReFlow by analyzing the intermediate representations of multimodal transformer blocks and identifying three key features. To extract these features from real images with sufficient structural preservation, we leverage mid-step latent, which is inverted only up to the mid-step. We then adapt attention during injection to improve editability and enhance alignment to the target text. Our method is training-free, requires no user-provided mask, and can be applied even without a source prompt. Extensive experiments on two benchmarks with nine baselines demonstrate its superior performance over prior methods, further validated by human evaluations confirming a strong user preference for our approach.
Poster
YINWEI WU · Xianpan Zhou · bing ma · Xuefeng Su · Kai Ma · Xinchao Wang

[ Exhibit Hall I ]

Abstract
While Text-to-Image (T2I) diffusion models excel at generating visually appealing images of individual instances, they struggle to accurately position and control the features generation of multiple instances. The Layout-to-Image (L2I) task was introduced to address the positioning challenges by incorporating bounding boxes as spatial control signals, but it still falls short in generating precise instance features. To address this Instance Feature Generation (IFG) task, we introduce the Instance Feature Adapter (IFAdapter). The IFAdapter enhances feature depiction by incorporating additional appearance tokens and utilizing an Instance Semantic Map to align instance-level features with spatial locations. The IFAdapter guides the diffusion process in a plug-and-play module, making it adaptable to various community models. For evaluation, we contribute an IFG benchmark and develop a verification pipeline to objectively compare models’ abilities to generate instances with accurate positioning and features. Experimental results demonstrate that IFAdapter outperforms other models in both quantitative and qualitative evaluations.
Poster
Qian Wang · Aleksandar Cvejic · Abdelrahman Eldesokey · Peter Wonka

[ Exhibit Hall I ]

Abstract
We introduce EditCLIP, a novel representation-learning approach for image editing. Our method learns a unified representation of edits by jointly encoding an input image and its edited counterpart, effectively capturing their transformation. To evaluate its effectiveness, we employ EditCLIP to solve two tasks: exemplar-based image editing and automated edit evaluation. In exemplar-based image editing, we replace text-based instructions in InstructPix2Pix with EditCLIP embeddings computed from a reference exemplar image pair. Experiments demonstrate that our approach outperforms state-of-the-art methods while being more efficient and versatile. For automated evaluation, EditCLIP assesses image edits by measuring the similarity between the EditCLIP embedding of a given image pair and either a textual editing instruction or the EditCLIP embedding of another reference image pair. Experiments show that EditCLIP aligns more closely with human judgments than existing CLIP-based metrics, providing a reliable measure of edit quality and structural preservation.
Poster
Nataniel Ruiz · Yuanzhen Li · Neal Wadhwa · Yael Pritch · Michael Rubinstein · David Jacobs · Shlomi Fruchter

[ Exhibit Hall I ]

Abstract
We present Magic Insert, a method to drag-and-drop subjects from a user-provided image into a target image of a different style in a plausible manner while matching the style of the target image. This work formalizes our version of the problem of style-aware drag-and-drop and proposes to tackle it by decomposing it into two sub-problems: style-aware personalization and realistic object insertion in stylized images. For style-aware personalization, we cast our method as a weight-and-text-embedding finetuning method with inference-time module-targeted style injection. For subject insertion, we propose Bootstrapped Domain Adaption (BDA) to adapt a domain-specific photorealistic object insertion model to the domain of diverse artistic styles. Overall, the method significantly outperforms traditional and state-of-the-art approaches that struggle with quality, subject fidelity and harmonious stylization. Finally, we present a new dataset, SubjectPlop, to facilitate evaluation and future progress in this area.
Poster
yifei xia · Suhan Ling · Fangcheng Fu · Yujie Wang · Huixia Li · Xuefeng Xiao · Bin CUI

[ Exhibit Hall I ]

Abstract
Generating high-quality long videos with Diffusion Transformers (DiTs) faces significant latency due to computationally intensive attention mechanisms. For instance, generating an 8s 720p video (110K tokens) with HunyuanVideo requires around 600 PFLOPs, with attention computations consuming about 500 PFLOPs.To tackle this, we propose **AdaSpa**, the first **Dynamic Pattern** and **Online Precise Search** sparse attention method for DiTs. First, AdaSpa uses a blockified pattern to efficiently represent the hierarchical sparsity inherent in DiTs, significantly reducing attention complexity while preserving video fidelity. This is motivated by our observation that DiTs' sparsity exhibits hierarchical and blockified structures across modalities.Second, AdaSpa introduces Fused LSE-Cached Search with Head-Adaptive Block Sparse Attention for efficient online precise search and computation. This approach leverages the invariance of sparse patterns and LSE across denoising steps, allowing precise real-time identification of sparse patterns with minimal overhead.AdaSpa is an **adaptive, plug-and-play solution** that seamlessly integrates into existing DiT models without additional training or data profiling. Extensive experiments validate that AdaSpa significantly accelerates video generation from 1.59$\times$ to 2.04$\times$ while maintaining video quality, demonstrating strong effectiveness.
Poster
Rohit Gandikota · Zongze Wu · Richard Zhang · David Bau · Eli Shechtman · Nicholas Kolkin

[ Exhibit Hall I ]

Abstract
We present SliderSpace, a framework for automatically decomposing the visual capabilities of diffusion models into controllable and human-understandable directions. Unlike existing control methods that require a user to specify attributes for each edit direction individually, SliderSpace discovers multiple interpretable and diverse directions simultaneously from a single text prompt. Each direction is trained as a low-rank adaptor, enabling compositional control and the discovery of surprising possibilities in the model's latent space. Through extensive experiments on state-of-the-art diffusion models, we demonstrate SliderSpace's effectiveness across three applications: concept decomposition, artistic style exploration, and diversity enhancement. Our quantitative evaluation shows that SliderSpace-discovered directions decompose the visual structure of model's knowledge effectively, offering insights into the latent capabilities encoded within diffusion models. User studies further validate that our method produces more diverse and useful variations compared to baselines.
Poster
Yi Huang · Wei Xiong · He Zhang · Chaoqi Chen · Jianzhuang Liu · Mingfu Yan · Shifeng Chen

[ Exhibit Hall I ]

Abstract
Building on the success of diffusion models in image generation and editing, video editing has recently gained substantial attention. However, maintaining temporal consistency and motion alignment still remains challenging. To address these issues, this paper proposes DINO-guided Video Editing (DIVE), a framework designed to facilitate subject-driven editing in source videos conditioned on either target text prompts or reference images with specific identities. The core of DIVE lies in leveraging the powerful semantic features extracted from a pretrained DINOv2 model as implicit correspondences to guide the editing process. Specifically, to ensure temporal motion consistency, DIVE employs DINO features to align with the motion trajectory of the source video. For precise subject editing, DIVE incorporates the DINO features of reference images into a pretrained text-to-image model to learn Low-Rank Adaptations (LoRAs), effectively registering the target subject’s identity. Extensive experiments on diverse real-world videos demonstrate that our framework can achieve high-quality editing results with robust motion consistency, highlighting the potential of DINO to contribute to video editing.
Poster
Bin Fu · Zixuan Wang · Kainan Yan · Shitian Zhao · Qi Qin · Jie Wen · Junjun He · Peng Gao

[ Exhibit Hall I ]

Abstract
Few-shot font generation (FFG) aims to create new font images by imitating the style from a limited set of reference images, while maintaining the content from the source images. Although this task has achieved significant progress, most existing methods still suffer from the incorrect generation of complicated character structure and detailed font style. To address the above issues, in this paper, we regard font generation as a font transfer process from the source font to the target font, and construct a video generation framework to model this process. Moreover, a test-time condition alignment mechanism is further developed to enhance the consistency between the generated samples and the provided condition samples. Specifically, we first construct a diffusion-based image-to-image font generation framework for the few-shot font generation task. This framework is expanded into an image-to-video font generation framework by integrating temporal components and frame-index information, enabling the production of high-quality font videos that transition from the source font to the target font. Based on this framework, we develop a noise inversion mechanism in the generative process to perform content and style alignment between the generated samples and the provided condition samples, enhancing style consistency and structural accuracy. The experimental results show that …
Poster
Jeongho Kim · Hoiyeong Jin · Sunghyun Park · Jaegul Choo

[ Exhibit Hall I ]

Abstract
Recent virtual try-on approaches have advanced by fine-tuning pre-trained text-to-image diffusion models to leverage their powerful generative ability; however, the use of text prompts in virtual try-on remains underexplored. This paper tackles a text-editable virtual try-on task that modifies the clothing based on the provided clothing image while editing the wearing style (e.g., tucking style, fit) according to the text descriptions. In the text-editable virtual try-on, three key aspects exist: (i) designing rich text descriptions for paired person-clothing data to train the model, (ii) addressing the conflicts where textual information of the existing person's clothing interferes the generation of the new clothing, and (iii) adaptively adjusting the inpainting mask aligned with the text descriptions, ensuring proper editing areas while preserving the original person's appearance irrelevant to the new clothing. To address these aspects, we propose PromptDresser, a text-editable virtual try-on model that leverages large multimodal model (LMM) assistance to enable high-quality and versatile manipulation based on generative text prompts. Our approach utilizes LMMs via in-context learning to generate detailed text descriptions for person and clothing images independently, including pose details and editing attributes using minimal human cost. Moreover, to ensure the editing areas, we adjust the inpainting mask depending on …
Poster
Fengyuan Shi · Zhuoyan Luo · Yixiao Ge · Yujiu Yang · Ying Shan · Limin Wang

[ Exhibit Hall I ]

Abstract
Existing vector quantization (VQ) methods struggle with scalability, largely attributed to the instability of the codebook that undergoes partial updates during training. The codebook is prone to collapse as utilization decreases, due to the progressively widening distribution gap between non-activated codes and visual features. To solve the problem, we propose Index Backpropagation Quantization (IBQ), a new VQ method for the joint optimization of all codebook embeddings and the visual encoder. Applying a straight-through estimator on the one-hot categorical distribution between the encoded feature and codebook, all codes are differentiable and maintain a consistent latent space with the visual encoder. IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook ($2^{18}$) with high dimension ($256$) and high utilization. Experiments on the standard ImageNet benchmark demonstrate the scalability and superiority of IBQ, achieving competitive results on reconstruction and the application of autoregressive visual generation.
Poster
JiaKui Hu · Zhengjian Yao · Lujia Jin · Hangzhou He · Yanye Lu

[ Exhibit Hall I ]

Abstract
Translation equivariance is a fundamental inductive bias in image restoration, ensuring that translated inputs produce translated outputs. Attention mechanisms in modern restoration transformers undermine this property, adversely impacting both training convergence and generalization. To alleviate this issue, we propose two key strategies for incorporating translation equivariance: slide indexing and component stacking. Slide indexing maintains operator responses at fixed positions, with sliding window attention being a notable example, while component stacking enables the arrangement of translation-equivariant operators in parallel or sequentially, thereby building complex architectures while preserving translation equivariance. However, these strategies still create a dilemma in model design between the high computational cost of self-attention and the fixed receptive field associated with sliding window attention. To address this, we develop an adaptive sliding indexing mechanism to efficiently select key-value pairs for each query, which are then concatenated in parallel with globally aggregated key-value pairs. The designed network, called the Translation Equivariance Adaptive Transformer (TEAFormer), is assessed across a variety of image restoration tasks. The results highlight its superiority in terms of effectiveness, training convergence, and generalization.
Poster
Mengyu Wang · Henghui Ding · Jianing Peng · Yao Zhao · Yunpeng Chen · Yunchao Wei

[ Exhibit Hall I ]

Abstract
In text-to-image generation, producing a series of consistent contents that preserve the same identity is highly valuable for real-world applications. Although a few works have explored training-free methods to enhance the consistency of generated subjects, we observe that they suffer from the following problems.First, they fail to maintain consistent background details, which limits their applicability. Furthermore, when the foreground character undergoes large motion variations, inconsistencies in identity and clothing details become evident. To address these problems, we propose CharaConsist, which employs point-tracking attention and adaptive token merge along with decoupled control of the foreground and background.CharaConsist enables fine-grained consistency for both foreground and background, supporting the generation of one character in continuous shots within a fixed scene or in discrete shots across different scenes.Moreover, CharaConsist is the first consistent generation method tailored for text-to-image DiT model. Its ability to maintain fine-grained consistency, combined with the larger capacity of latest base model, enables it to produce high-quality visual outputs, broadening its applicability to a wider range of real-world scenarios.
Poster
Jiahao Wang · Ning Kang · Lewei Yao · Mengzhao Chen · Chengyue Wu · Songyang Zhang · Shuchen Xue · Yong Liu · Taiqiang Wu · Xihui Liu · Kaipeng Zhang · Shifeng Zhang · Wenqi Shao · Zhenguo Li · Ping Luo

[ Exhibit Hall I ]

Abstract
In this paper, we investigate how to convert a pre-trained Diffusion Transformer (DiT) into a linear DiT, as its simplicity, parallelism, and efficiency for image generation. Through detailed exploration, we offer a suite of ready-to-use solutions, ranging from linear attention design to optimization strategies. Our core contributions include 5 practical guidelines: 1) Applying depth-wise convolution within simple linear attention is sufficient for image generation. 2) Using fewer heads in linear attention provides a free-lunch performance boost without increasing latency. 3) Inheriting weights from a fully converged, pre-trained DiT. 4) Loading all parameters except those related to linear attention. 5) Hybrid knowledge distillation: using a pre-trained teacher DiT to help the training of the student linear DiT, supervising not only the predicted noise but also the variance of the reverse diffusion process. These guidelines lead to our proposed Linear Diffusion Transformer (LiT), which serves as a safe and efficient alternative baseline for DiT with pure linear attention. In class-conditional 256×256 and 512×512 ImageNet generation, LiT can be quickly adapted from DiT using only 20% and 33% of DiT’s training steps, respectively, while achieving comparable performance. LiT also rivals methods based on Mamba or Gated Linear Attention. Moreover, the same guidelines generalize …
Poster
Jiayi Guo · Chuanhao Yan · Xingqian Xu · Yulin Wang · Kai Wang · Gao Huang · Humphrey Shi

[ Exhibit Hall I ]

Abstract
Ensuring precise alignments between diffusion-generated images and input prompts is a long-term challenge. Earlier works finetune diffusion weight using high-quality preference data, which tends to be limited and difficult to scale up. Recent editing-based methods further refine local regions of generated images but may compromise overall image quality. In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based alignment framework that requires no extra data or editing operations. Specifically, given a generated image and its prompt, IMG a) utilizes a multimodal large language model (MLLM) to identify misalignments; b) introduces an Implicit Aligner that manipulates diffusion conditioning features to reduce misalignments and enable re-generation; and c) formulates the re-alignment goal into a trainable objective, namely Iteratively Updated Preference Objective. Extensive qualitative and quantitative evaluations on SDXL, SDXL-DPO, and FLUX show that IMG outperforms existing alignment methods. Furthermore, IMG acts as a flexible plug-and-play adapter, seamlessly enhancing prior finetuning-based alignment methods. Our code will be open-sourced.
Poster
Paschalis Giakoumoglou · Dimitrios Karageorgiou · Symeon Papadopoulos · Panagiotis Petrantonakis

[ Exhibit Hall I ]

Abstract
Recent advancements in generative AI have made text-guided image inpainting—adding, removing, or altering image regions using textual prompts—widely accessible. However, generating semantically correct photorealistic imagery, typically requires carefully-crafted prompts and iterative refinement by evaluating the realism of the generated content - tasks commonly performed by humans. To automate the generative process, we propose Semantically Aligned and Uncertainty Guided AI Image Inpainting (SAGI), a model-agnostic pipeline, to sample prompts from a distribution that closely aligns with human perception and to evaluate the generated content and discard one that deviates from such a distribution, which we approximate using pretrained Large Language Models and Vision-Language Models. By applying this pipeline on multiple state-of-the-art inpainting models, we create the SAGI Dataset (SAGI-D), currently the largest and most diverse dataset of AI-generated inpaintings, comprising over 95k inpainted images and a human-evaluated subset. Our experiments show that semantic alignment significantly improves image quality and aesthetics, while uncertainty guidance effectively identifies realistic manipulations — human ability to identify inpainted images from real ones drops from 74\% to 35\% in terms of accuracy, after applying our pipeline. Moreover, using SAGI-D for training several image forensic approaches increases in-domain detection performance on average by 37.4\% and out-of-domain generalization by …
Poster
Ao Ma · Jiasong Feng · Ke Cao · Jing Wang · Yun Wang · Quanwei Zhang · Zhanjie Zhang

[ Exhibit Hall I ]

Abstract
Storytelling tasks involving generating consistent subjects have gained significant attention recently. However, existing methods, whether training-free or training-based, continue to face challenges in maintaining subject consistency due to the lack of fine-grained guidance and inter-frame interaction. Additionally, the scarcity of high-quality data in this field makes it difficult to precisely control storytelling tasks, including the subject's position, appearance, clothing, expression, and posture, thereby hindering further advancements. In this paper, we demonstrate that layout conditions, such as the subject's position and detailed attributes, effectively facilitate fine-grained interactions between frames. This not only strengthens the consistency of the generated frame sequence but also allows for precise control over the subject’s position, appearance, and other key details. Building on this, we introduce an advanced storytelling task: Layout-Toggable Storytelling, which enables precise subject control by incorporating layout conditions. To address the lack of high-quality datasets with layout annotations for this task, we develop Lay2Story-1M, which contains over 1 million 720p and higher-resolution images, processed from approximately 11,300 hours of cartoon videos. Building on Lay2Story-1M, we create Lay2Story-Bench, a benchmark with 3,000 prompts designed to evaluate the performance of different methods on this task. Furthermore, we propose Lay2Story, a robust framework based on the Diffusion …
Poster
Zhenyu Yan · Jian Wang · Aoqiang Wang · Yuhan Li · Wenxiang Shang · Zhu Hangcheng

[ Exhibit Hall I ]

Abstract
In image editing tasks,high-quality text editing capabilities can significantly reduce both human and material resource costs.Existing methods, however,face significant limitations in terms of stroke accuracy for complex text and controllability of generated text styles.To address these challenges,we propose TextMaster,a solution capable of accurately editing text across various scenarios and image regions,while ensuring proper layout and controllable text style.Our approach incorporates adaptive standard letter spacing as guidance during training and employs adaptive mask boosting to prevent the leakage of text position and size information.By leveraging an attention mechanism to compute the intermediate layer bounding box regression loss for each character,our method enables the learning of text layout across diverse contexts.Additionally,we enhance text rendering accuracy and fidelity by injecting high-resolution standard font information and applying perceptual loss within the text editing region.Through a novel style injection technique, we achieve controllable style transfer for the injected text.Through comprehensive experiments,we demonstrate the state-of-the-art performance of our method.
Poster
Alakh Desai · Nuno Vasconcelos

[ Exhibit Hall I ]

Abstract
Diffusion models (DMs) have demonstrated an unparalleled ability to create diverse and high-fidelity images from text prompts. However, they are also well-known to vary substantially regarding both prompt adherence and quality. Negative prompting was introduced to improve prompt compliance by specifying what an image must not contain. Previous works have shown the existence of an ideal negative prompt that can maximize the odds of the positive prompt. In this work, we explore relations between negative prompting and classifier-free guidance (*CFG*) to develop a sampling procedure, *Adaptive Negative Sampling Without External Resources* (*ANSWER*), that accounts for both positive and negative conditions from a single prompt. This leverages the internal understanding of negation by the diffusion model to increase the odds of generating images faithful to the prompt. *ANSWER* is a training-free technique, applicable to any model that supports *CFG*, and allows for negative grounding of image concepts without an explicit negative prompts, which are lossy and incomplete. Experiments show that adding *ANSWER* to existing DMs outperforms the baselines on multiple benchmarks and is preferred by humans 2x more over the other methods.
Poster
Donald Shenaj · Ondrej Bohdal · Mete Ozay · Pietro Zanuttigh · Umberto Michieli

[ Exhibit Hall I ]

Abstract
Recent advancements in image generation models have enabled personalized image creation with both user-defined subjects (content) and styles. Prior works achieved personalization by merging corresponding low-rank adapters (LoRAs) through optimization-based methods, which are computationally demanding and unsuitable for real-time use on resource-constrained devices like smartphones. To address this, we introduce LoRA.rar, a method that not only improves image quality but also achieves a remarkable speedup of over $4000\times$ in the merging process. We collect a dataset of style and subject LoRAs and pre-train a hypernetwork on a diverse set of content-style LoRA pairs, learning an efficient merging strategy that generalizes to new, unseen content-style pairs, enabling fast, high-quality personalization. Moreover, we identify limitations in existing evaluation metrics for content-style quality and propose a new protocol using multimodal large language models (MLLMs) for more accurate assessment. Our method significantly outperforms the current state of the art in both content and style fidelity, as validated by MLLM assessments and human evaluations.
Poster
Andy Regensky · Marc Windsheimer · Fabian Brand · Andre Kaup

[ Exhibit Hall I ]

Abstract
Neural video codecs (NVCs) have seen fast-paced advancement in recent years and already perform close to state-of-the-art traditional video codecs like H.266/VVC. However, NVC investigations have so far focused on improving performance for classical perspective video leaving the increasingly important 360-degree video format unexplored. In this paper, we address this issue and present how existing NVCs can be optimized for 360-degree video while also improving performance on perspective video. As no suitable datasets for neural 360-degree video compression exist, we publish a large-scale 360-degree video dataset consisting of more than 6000 user generated 9-frame sequences with resolutions ranging from 0.5K to 8K. We propose a novel method for training data augmentation exploiting the spherical characteristics of 360-degree video that shows to be crucial for achieving maximum compression performance. An additional positional feature encoding further supports the NVC in dynamic bitrate allocation notably improving the performance for both 360-degree and perspective video. Overall, we achieve rate savings of almost 8% for 360-degree video and more than 3% for perspective video with minimal complexity overhead. The dataset is available at: {link will be provided upon acceptance}. Source code and pre-trained model weights are available at: {link will be provided upon acceptance}.
Poster
Chuanwei Huang · Zexi Jia · Hongyan Fei · Yeshuang Zhu · Zhiqiang Yuan · Ying Deng · Jiapei Zhang · Xiaoyue Duan · Jinchao Zhang · Jie Zhou

[ Exhibit Hall I ]

Abstract
With the rapid advancement of generative models, we can now create highly realistic images. This represents a significant technical breakthrough but also introduces new challenges for copyright protection. Previous methods for detecting copyright infringement in AI-generated images mainly depend on global similarity. However, real-world infringement often occurs only on certain attributes rather than being a global infringement. To address these challenges, we propose a novel Multi-aspect Copyright Infringement Detection (MCID) task, which encompasses various types of infringement, including content, style, structure, and intellectual property infringement. We further develop the Hybrid Infringement Detection Model (HIDM) to address the MCID task. By combining feature-based methods with VLMs, it enables the detection of various infringement types and provides interpretable results. To ensure the MCID task meets actual legal requirements, we construct a Large-Scale Copyright Dataset (LSCD) with clear author copyright ownership. Based on LSCD, we provide a benchmark annotated by legal experts for performance evaluation. Experimental results show that HIDM effectively detects various types of image copyright infringement and offers a more interpretable and superior solution compared to previous methods.
Poster
Yuanhao Zhai · Yen-Liang Lin · Minxu Peng · Larry Davis · Ashwin Chandramouli · Junsong Yuan · David Doermann

[ Exhibit Hall I ]

Abstract
Existing outfit recommendation frameworks mainly focus on outfit compatibility prediction and complementary item retrieval. However, the outfit items are predicted by the pre-trained model and can not be controlled by the text prompt. We present a text-driven outfit generation framework, Text2Outfit, which generates outfits controlled by the text prompt. Our framework supports two forms of outfit recommendation: 1) text-to-outfit generation, which retrieves the outfits given the prompt, where the prompt includes the specification of the entire outfit (e.g., occasion or season) and the individual outfit items (e.g., product feature), and 2) seed-to-outfit generation, which additionally uses a seed item (image or item descriptions) as input and retrieves items to build outfits. We train a large language model framework (LLM) to predict a set of embeddings to retrieve outfit items. We devise an attention masking mechanism in LLM to handle the alignment between the outfit text descriptions in the prompt and the image tokens from different categories. We conducted the experiments on the Poylvore data set and evaluated outfit retrieval performance from two perspectives: 1) feature matching for outfit items and 2) outfit compatibility. The results show that our approach achieves significantly better performance than the baseline approaches for text to …
Poster
Hailing Wang · Jianglin Lu · Yitian Zhang · Yun Fu

[ Exhibit Hall I ]

Abstract
Quantization techniques, including quantization-aware training (QAT) and post-training quantization (PTQ), have become essential for inference acceleration of image super-resolution (SR) networks. Compared to QAT, PTQ has garnered significant attention as it eliminates the need for ground truth and model retraining. However, existing PTQ methods for SR often fail to achieve satisfactory performance as they overlook the impact of outliers in activation. Our empirical analysis reveals that these prevalent activation outliers are strongly correlated with image color information, and directly removing them leads to significant performance degradation. Motivated by this, we propose a dual-region quantization strategy that partitions activations into an outlier region and a dense region, applying uniform quantization to each region independently to better balance bit-width allocation. Furthermore, we observe that different network layers exhibit varying sensitivities to quantization, leading to different levels of performance degradation. To address this, we introduce sensitivity-aware finetuning that encourages the model to focus more on highly sensitive layers, further enhancing quantization performance. Extensive experiments demonstrate that our method outperforms existing PTQ approaches across various SR networks and datasets, while achieving performance comparable to QAT methods in most scenarios with at least a 75 $\times$ speedup.
Poster
Junsong Chen · Shuchen Xue · Yuyang Zhao · Jincheng YU · Sayak Paul · Junyu Chen · Han Cai · Enze Xie · Song Han

[ Exhibit Hall I ]

Abstract
This paper presents SANA-Sprint, an efficient diffusion model for ultra-fast text-to-image (T2I) generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4.We introduce three key innovations: $\textbf{(1)}$ We propose a training-free approach that transforms a pre-trained flow-matching model for continuous-time consistency distillation (sCM), eliminating costly training from scratch and achieving high training efficiency. Our hybrid distillation strategy combines sCM with latent adversarial distillation (LADD): sCM ensures alignment with the teacher model, while LADD enhances single-step generation fidelity. $\textbf{(2)}$ SANA-Sprint is a unified step-adaptive model that achieves high-quality generation in 1-4 steps, eliminating step-specific training and improving efficiency. $\textbf{(3)}$ We integrate ControlNet with SANA-Sprint for real-time interactive image generation, enabling instant visual feedback for user interaction. SANA-Sprint establishes a new Pareto frontier in speed-quality tradeoffs, achieving state-of-the-art performance with 7.59 FID and 0.74 GenEval in just 1 step — outperforming FLUX-schnell (7.94 FID / 0.71 GenEval) while being 10× faster (0.1s vs 1.1s on H100). It also achieves 0.1s (T2I) and 0.25s (ControlNet) latency for 1024$\times$1024 images on H100, and 0.31s (T2I) on an RTX 4090, showcasing its exceptional efficiency and potential for AI-powered consumer applications (AIPC). Code and …
Poster
Kasra Arabi · R. Teal Witter · Chinmay Hegde · Niv Cohen

[ Exhibit Hall I ]

Abstract
Generative models have rapidly evolved to generate realistic outputs. However, their synthetic outputs increasingly challenge the clear distinction between natural and AI-generated content, necessitating robust watermarking techniques.Watermarks are typically expected to preserve the integrity of the target image, withstand removal attempts, and prevent unauthorized replication onto unrelated images. To address this need, recent methods embed persistent watermarks into images produced by diffusion models using the initial noise. Yet, to do so, they either distort the distribution of generated images or rely on searching through a long dictionary of used keys for detection.In this paper, we propose a novel watermarking method that embeds semantic information about the generated image directly into the watermark, enabling a distortion-free watermark that can be verified without requiring a database of key patterns. Instead, the key pattern can be inferred from the semantic embedding of the image using locality-sensitive hashing.Furthermore, conditioning the watermark detection on the original image content improves robustness against forgery attacks. To demonstrate that, we consider two largely overlooked attack strategies: (i) an attacker extracting the initial noise and generating a novel image with the same pattern; (ii) an attacker inserting an unrelated (potentially harmful) object into a watermarked image, possibly while preserving …
Poster
Giuseppe Cartella · Vittorio Cuculo · Alessandro D'Amelio · Marcella Cornia · Giuseppe Boccignone · Rita Cucchiara

[ Exhibit Hall I ]

Abstract
Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths. These results highlight its ability to better capture the complexity of human visual behavior, pushing forward gaze prediction research. Source code and models will be made publicly available.
Poster
Lorenzo Baraldi · Davide Bucciarelli · Federico Betti · Marcella Cornia · Lorenzo Baraldi · Nicu Sebe · Rita Cucchiara

[ Exhibit Hall I ]

Abstract
Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. DICE consists of two key components: a difference detector and a coherence estimator, both built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision. Through extensive experiments, we evaluate each stage of our pipeline, comparing different MLLMs within the proposed framework. We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment. We will publicly release our source code, models, and data.
Poster
Haoxuan Li · Ziya Erkoç · Lei Li · Daniele Sirigatti · Vladislav Rosov · Angela Dai · Matthias Nießner

[ Exhibit Hall I ]

Abstract
We introduce MeshPad, a generative approach that creates 3D meshes from sketch inputs. Building on recent advances in artist-designed triangle mesh generation, our approach addresses the need for interactive mesh creation. To this end, we focus on enabling consistent edits by decomposing editing into ‘deletion’ of regions of a mesh, followed by ‘addition’ of new mesh geometry. Both operations are invoked by simple user edits of a sketch image, facilitating an iterative content creation process and enabling the construction of complex 3D meshes. Our approach is based on a triangle sequence-based mesh representation, exploiting a large Transformer model for mesh triangle addition and deletion. In order to perform edits interactively, we introduce a vertex-aligned speculative prediction strategy on top of our additive mesh generator. This speculator predicts multiple output tokens corresponding to a vertex, thus significantly reducing the computational cost of inference and accelerating the editing process, making it possible to execute each editing step in only a few seconds. Comprehensive experiments demonstrate that MeshPad outperforms state-of-the-art sketch-conditioned mesh generation methods, achieving more than 22% mesh quality improvement in Chamfer distance, and being preferred by 90% of participants in perceptual evaluations.
Poster
Kwanyoung Kim · Byeongsu Sim

[ Exhibit Hall I ]

Abstract
Diffusion models have shown impressive results in generating high-quality conditional samples using guidance techniques such as Classifier-Free Guidance (CFG). However, existing methods often require additional training or neural function evaluations (NFEs), making them incompatible with guidance-distilled models. Also, they rely on heuristic approaches that need identifying target layers. In this work, we propose a novel and efficient method, termed PLADIS, which boosts pre-trained models (U-Net/Transformer) by leveraging sparse attention. Specifically, we extrapolate query-key correlations using softmax and its sparse counterpart in the cross-attention layer during inference, without requiring extra training or NFEs. By leveraging the noise robustness of sparse attention, our PLADIS unleashes the latent potential of text-to-image diffusion models, enabling them to excel in areas where they once struggled with newfound effectiveness. It integrates seamlessly with guidance techniques, including guidance-distilled models. Extensive experiments show notable improvements in text alignment and human preference, offering a highly efficient and universally applicable solution.
Poster
Zongyu Lin · Wei Liu · Chen Chen · Jiasen Lu · Wenze Hu · Tsu-Jui Fu · Jesse Allardice · Zhengfeng Lai · Liangchen Song · Bowen Zhang · cha chen · Yiran Fei · Lezhi Li · Yizhou Sun · Kai-Wei Chang · Yinfei Yang

[ Exhibit Hall I ]

Abstract
We present a simple and scalable text and image conditioned video generation method. Our approach, named STIV, integrates a variable number of image conditions into a Diffusion Transformer (DiT) through frame replacement. This design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously, as well as long video generation through autoregressive rollouts.Additionally, STIV can be easily extended to various applications, such as video prediction, frame interpolation, and multi-view generation, etc.With comprehensive ablation studies on T2I, T2V, TI2V, and long video generation, STIV demonstrate strong performance, despite its simple design. An 8.7B model with \(512^2\) resolution achieves 83.1 on VBench T2V, surpassing both leading open and closed-source models like CogVideoX-5B, Pika, Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result of 90.1 on VBench I2V task at \(512^2\) resolution. Combine all of these, we finally scale up our model to 540p with over 200 frames. By providing a transparent recipe for building cutting-edge video generation models, we aim to empower future research and accelerate progress for video generation.
Poster
Zonglin Lyu · Chen Chen

[ Exhibit Hall I ]

Abstract
Video Frame Interpolation (VFI) aims to predict the intermediate frame $I_n$ (we use n to denote time in videos to avoid notation overload with the timestep $t$ in diffusion models) based on two consecutive neighboring frames $I_0$ and $I_1$. Recent approaches apply diffusion models (both image-based and video-based) in this task and achieve strong performance. However, image-based diffusion models are unable to extract temporal information and are relatively inefficient compared to non-diffusion methods. Video-based diffusion models can extract temporal information, but they are too large in terms of training scale, model size, and inference time. To mitigate the above issues, we propose Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation (TLB-VFI), an efficient video-based diffusion model. By extracting rich temporal information from video inputs through our proposed 3D-wavelet gating and temporal-aware autoencoder, our method achieves \textbf{20}\% improvement in FID on the most challenging datasets over recent SOTA of image-based diffusion models. Meanwhile, due to the existence of rich temporal information, our method achieves strong performance while having \textbf{3}$\times$ fewer parameters. Such a parameter reduction results in \textbf{2.3}$\times$ speed up. By incorporating optical flow guidance, our method requires \textbf{9000}$\times$ less training data and achieves over \textbf{20}$\times$ fewer parameters than video-based …
Poster
Yingjian Chen · Lei Zhang · Yakun Niu

[ Exhibit Hall I ]

Abstract
The rise of generative models has raised concerns about image authenticity online, highlighting the urgent need for a detector that is (1) highly generalizable, capable of handling unseen forgery techniques, and (2) data-efficient, achieving optimal performance with minimal training data, enabling it to counter newly emerging forgery techniques effectively. To achieve this, we propose $\textbf{\textit{ForgeLens}}$, a data-efficient, feature-guided framework that incorporates two lightweight designs to enable a frozen network to focus on forgery-specific features. First, we introduce the Weight-Shared Guidance Module (WSGM), which guides the extraction of forgery-specific features during training. Second, a forgery-aware feature integrator, FAFormer, is used to effectively integrate forgery information across multi-stage features. ForgeLens addresses a key limitation of previous frozen network-based methods, where general-purpose features extracted from large datasets often contain excessive forgery-irrelevant information. As a result, it achieves strong generalization and reaches optimal performance with minimal training data. Experimental results on 19 generative models, including both GANs and diffusion models, demonstrate improvements of 13.61\% in Avg.Acc and 8.69\% in Avg.AP over the base model. Notably, ForgeLens outperforms existing forgery detection methods, achieving state-of-the-art performance with just 1\% of the training data.
Poster
Daniel Winter · Asaf Shul · Matan Cohen · Dana Berman · Yael Pritch · Alex Rav-Acha · Yedid Hoshen

[ Exhibit Hall I ]

Abstract
This paper introduces a tuning-free method for both object insertion and subject-driven generation. The task involves composing an object, given multiple views, into a scene specified by either an image or text. Existing methods struggle to fully meet the task's challenging objectives: (i) seamlessly composing the object into the scene with photorealistic pose and lighting, and (ii) preserving the object's identity. We hypothesize that achieving these goals requires large scale supervision, but manually collecting sufficient data is simply too expensive. The key observation in this paper is that many mass-produced objects recur across multiple images of large unlabeled datasets, in different scenes, poses, and lighting conditions. We use this observation to create massive supervision by retrieving sets of diverse views of the same object. This powerful paired dataset enables us to train a straightforward text-to-image diffusion architecture to map the object and scene descriptions to the composited image. We compare our method, ObjectMate, with state-of-the-art methods for object insertion and subject-driven generation, using a single or multiple references. Empirically, ObjectMate achieves superior identity preservation and more photorealistic composition. Differently from many other multi-reference methods, ObjectMate does not require slow test-time tuning.
Poster
Yanran Zhang · Bingyao Yu · Yu Zheng · Wenzhao Zheng · Yueqi Duan · Lei Chen · Jie Zhou · Jiwen Lu

[ Exhibit Hall I ]

Abstract
The emergence of visual autoregressive (AR) models has revolutionized image generation, while presenting new challenges for synthetic image detection.Unlike previous GAN or diffusion-based methods, AR models generate images through discrete token prediction, exhibiting both marked improvements in image synthesis quality and unique characteristics in their vector-quantized representations. In this paper, we propose to leverage Discrete Distribution Discrepancy-aware Quantization Error ($\bf{D^3QE}$) for autoregressive-generated image detection that exploits the distinctive patterns and the frequency distribution bias of the codebook existing in real and fake images. We introduce a discrete distribution discrepancy-aware transformer that integrates dynamic codebook frequency statistics into its attention mechanism, fusing semantic features and quantization error latent.To evaluate our method, we construct a comprehensive dataset covering 7 mainstream visual AR models.Experiments demonstrate superior detection accuracy and strong generalization of $\bf{D^3QE}$ across different AR models, while maintaining robustness under various real-world perturbations.
Poster
Huanpeng Chu · Wei Wu · Guanyu Feng · Yutao Zhang

[ Exhibit Hall I ]

Abstract
Diffusion models have emerged as a powerful paradigm for generative tasks such as image synthesis and video generation, with Transformer architectures further enhancing performance. However, the high computational cost of diffusion Transformers—stemming from a large number of sampling steps and complex per-step computations—presents significant challenges for real-time deployment. In this paper, we introduce OmniCache, a training-free acceleration method that exploits the global redundancy inherent in the denoising process. Unlike existing methods that determine caching strategies based on inter-step similarities and tend to prioritize reusing later sampling steps, our approach originates from the sampling perspective of DIT models. We systematically analyze the model's sampling trajectories and strategically distribute cache reuse across the entire sampling process. This global perspective enables more effective utilization of cached computations throughout the diffusion trajectory, rather than concentrating reuse within limited segments of the sampling procedure. In addition, during cache reuse, we dynamically estimate the corresponding noise and filter it out to reduce its impact on the sampling direction. Extensive experiments demonstrate that our approach accelerates the sampling process while maintaining competitive generative quality, offering a promising and practical solution for efficient deployment of diffusion-based generative models.
Poster
Mahir Atmis · LEVENT KARACAN · Mehmet SARIGÜL

[ Exhibit Hall I ]

Abstract
Specular highlights, though valuable for human perception, are often undesirable in computer vision and graphics tasks as they can obscure surface details and affect analysis. Existing methods rely on multi-stage pipelines or multi-label datasets, making training difficult. In this study, we propose a one-step diffusion-based model for specular highlight removal, leveraging a pre-trained diffusion-based image generation model with an adaptation mechanism to enhance efficiency and adaptability. To further improve the adaptation process, we introduce ProbLoRA, a novel modification of Low-Rank Adaptation (LoRA), designed to adapt the diffusion model for highlight removal effectively. Our approach surpasses existing methods, achieving state-of-the-art performance in both quantitative metrics and visual quality. Extensive experiments on synthetic and real-world datasets demonstrate the effectiveness of our method, highlighting its robustness and generalization capabilities.
Poster
Gopika Sudhakaran · Hikaru Shindo · Patrick Schramowski · Simone Schaub-Meyer · Kristian Kersting · Stefan Roth

[ Exhibit Hall I ]

Abstract
Visual relation detection (VRD) is the challenging task of identifying the relationships between objects in a scene. VRD models trained solely on relation detection data struggle to generalize beyond the relations on which they are trained. While prompt tuning has been used to adapt vision-language models (VLMs) for VRD, it relies on handcrafted prompts and struggles with novel or complex relationships. We argue that instruction tuning offers a more effective solution by fine-tuning VLMs on diverse instructional data. We thus introduce ART, an Adaptive Relation Tuning framework that adapts VLMs for VRD through instruction tuning and strategic instance selection. By converting VRD datasets into an instruction-tuning format and employing an adaptive sampling algorithm, ART directs the VLM to focus on informative relations while maintaining generalizability. We tune on a held-in set and evaluate across multiple held-out datasets of varying complexity. Our approach strongly improves over its baselines and can infer unseen relation concepts, a capability absent in mainstream VRD methods. We demonstrate ART's practical value by using the detected relationships for segmenting complex scenes.
Poster
Zerui Tao · Yuhta Takida · Naoki Murata · Qibin Zhao · Yuki Mitsufuji

[ Exhibit Hall I ]

Abstract
Parameter-Efficient Fine-Tuning (PEFT) of text-to-image models has become an increasingly popular technique with many applications. Among the various PEFT methods, Low-Rank Adaptation (LoRA) and its variants have gained significant attention due to their effectiveness, enabling users to fine-tune models with limited computational resources. However, the approximation gap between the low-rank assumption and desired fine-tuning weights prevents the simultaneous acquisition of ultra-parameter-efficiency and better performance. To reduce this gap and further improve the power of LoRA, we propose a new PEFT method that combines two classes of adaptations, namely, transform and residual adaptations. In specific, we first apply a full-rank and dense transform to the pre-trained weight. This learnable transform is expected to align the pre-trained weight as closely as possible to the desired weight, thereby reducing the rank of the residual weight. Then, the residual part can be effectively approximated by more compact and parameter-efficient structures, with a smaller approximation error. To achieve ultra-parameter-efficiency in practice, we design highly flexible and effective tensor decompositions for both the transform and residual adaptations. Additionally, popular PEFT methods such as DoRA can be summarized under this transform plus residual adaptation scheme. Experiments are conducted on fine-tuning Stable Diffusion models in subject-driven and controllable …
Poster
Jingyi Pan · Dan Xu · Qiong Luo

[ Exhibit Hall I ]

Abstract
Developing a unified pipeline that enables users to remove, re-texture, or replace objects in a versatile manner is crucial for text-guided 3D inpainting. However, there are still challenges in performing multiple 3D inpainting tasks within a unified framework: 1) Single reference inpainting methods lack robustness when dealing with views that are far from the reference view; 2) Appearance inconsistency arises when independently inpainting multi-view images with 2D diffusion priors; 3) Geometry inconsistency limits performance when there are significant geometric changes in the inpainting regions. To tackle these challenges, we introduce DiGA3D, a novel and versatile 3D inpainting pipeline that leverages diffusion models to propagate consistent appearance and geometry in a coarse-to-fine manner. First, DiGA3D develops a robust strategy for selecting multiple reference views to reduce errors during propagation. Next, DiGA3D designs an Attention Feature Propagation (AFP) mechanism that propagates attention features from the selected reference views to other views via diffusion models to maintain appearance consistency. Furthermore, DiGA3D introduces a Texture-Geometry Score Distillation Sampling (TG-SDS) loss to further improve the geometric consistency of inpainted 3D scenes.Extensive experiments on multiple 3D inpainting tasks demonstrate the effectiveness of our method. Our model and code will be made publicly available upon acceptance.
Poster
Saemi Moon · Minjong Lee · Sangdon Park · Dongwoo Kim

[ Exhibit Hall I ]

Abstract
As text-to-image diffusion models gain widespread commercial applications, there are increasing concerns about unethical or harmful use, including the unauthorized generation of copyrighted or sensitive content. Concept unlearning has emerged as a promising solution to these challenges by removing undesired and harmful information from the pre-trained model. However, the previous evaluations primarily focus on whether target concepts are removed while preserving image quality, neglecting the broader impacts such as unintended side effects. In this work, we propose Holistic Unlearning Benchmark (HUB), a comprehensive framework for evaluating unlearning methods across six key dimensions: faithfulness, alignment, pinpoint-ness, multilingual robustness, attack robustness, and efficiency. Our benchmark covers 33 target concepts, including 16,000 prompts per concept, spanning four categories: Celebrity, Style, Intellectual Property, and NSFW. Our investigation reveals that no single method excels across all evaluation criteria. By releasing our evaluation code and dataset, we hope to inspire further research in this area, leading to more reliable and effective unlearning methods.
Poster
Qing Lin · Jingfeng Zhang · YEW-SOON ONG · Mengmi Zhang

[ Exhibit Hall I ]

Abstract
Despite the rapid progress in image generation, emotional image editing remains under-explored. The semantics, context, and structure of an image can evoke emotional responses, making emotional image editing techniques valuable for various real-world applications, including treatment of psychological disorders, commercialization of products, and artistic design. First, we present a novel challenge of emotion-evoked image generation, aiming to synthesize images that evoke target emotions while retaining the semantics and structures of the original scenes. To address this challenge, we propose a diffusion model capable of effectively understanding and editing source images to convey desired emotions and sentiments. Moreover, due to the lack of emotion editing datasets, we provide a unique dataset consisting of 340,000 pairs of images and their emotion annotations. Furthermore, we conduct human psychophysics experiments and introduce a new evaluation metric to systematically benchmark all the methods. Experimental results demonstrate that our method surpasses all competitive baselines. Our diffusion model is capable of identifying emotional cues from original images, editing images that elicit desired emotions, and meanwhile, preserving the semantic structure of the original images. All code, model, and dataset will be made public.
Poster
Zehuan Huang · Yuan-Chen Guo · Haoran Wang · Ran Yi · Lizhuang Ma · Yanpei Cao · Lu Sheng

[ Exhibit Hall I ]

Abstract
Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image (T2I) models and require full fine-tuning, leading to high computational costs and degradation in image quality due to scarce high-quality 3D data. This paper introduces MV-Adapter, an efficient and versatile adapter that enhances T2I models and their derivatives without altering the original network structure or feature space. To efficiently model the 3D geometric knowledge within the adapter, we introduce innovative designs that include duplicated self-attention layers and parallel attention architecture, enabling the adapter to inherit the powerful priors of the pre-trained models to model the novel 3D knowledge. Moreover, we present a unified condition encoder that seamlessly integrates camera parameters and geometric information, facilitating applications such as text- and image-based 3D generation and texturing. MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion XL (SDXL), and demonstrates adaptability and versatility. It can also be extended to arbitrary view generation, enabling broader applications. We demonstrate that MV-Adapter sets a new quality standard for multi-view image generation, and opens up new possibilities due to its efficiency, adaptability and versatility.
Poster
Alessandro Conti · Massimiliano Mancini · Enrico Fini · Yiming Wang · Paolo Rota · Elisa Ricci

[ Exhibit Hall I ]

Abstract
Traditional image classification requires a predefined list of semantic categories. In contrast, Large Multimodal Models (LMMs) can sidestep this requirement by classifying images directly using natural language (e.g., answering the prompt "What is the main object in the image?"). Despite this remarkable capability, most existing studies on LMM classification performance are surprisingly limited in scope, often assuming a closed-world setting with a predefined set of categories. In this work, we address this gap by thoroughly evaluating LMM classification performance in a truly open-world setting. We first formalize the task and introduce an evaluation protocol, defining various metrics to assess the alignment between predicted and ground truth classes. We then evaluate 13 models across 10 benchmarks, encompassing prototypical, non-prototypical, fine-grained, and very fine-grained classes, demonstrating the challenges LMMs face in this task. Further analyses based on the proposed metrics reveal the types of errors LMMs make, highlighting challenges related to granularity and fine-grained capabilities, showing how tailored prompting and reasoning can alleviate them. Our evaluation suite will be made openly available, serving as a resource for future research.
Poster
Hanling Zhang · Rundong Su · Zhihang Yuan · Pengtao Chen · Mingzhu Shen · Yibo Fan · Shengen Yan · Guohao Dai · Yu Wang

[ Exhibit Hall I ]

Abstract
Text-to-image generation models, especially Multimodal Diffusion Transformers (MMDiT), have shown remarkable progress in generating high-quality images. However, these models often face significant computational bottlenecks, particularly in attention mechanisms, which hinder their scalability and efficiency. In this paper, we introduce DiTFastAttnV2, a post-training compression method designed to accelerate attention in MMDiT. Through an in-depth analysis of MMDiT’s attention patterns, we identify key differences from prior DiT-based methods and propose head-wise arrow attention and caching mechanisms to dynamically adjust attention heads, effectively bridging this gap. We also design an Efficient Fused Kernel for further acceleration. By leveraging local metric methods and optimization techniques, our approach significantly reduces the search time for optimal compression schemes to just minutes while maintaining generation quality. Furthermore, with the customized kernel, DiTFastAttnV2 achieves a 68\% reduction in attention FLOPs on 2K image generation without compromising visual fidelity.
Poster
Rui Xie · Rui Xie · Yuzhang Shang · Hanling Zhang · Siyuan Wang · Shengen Yan · Guohao Dai · Yu Wang

[ Exhibit Hall I ]

Abstract
Diffusion Transformer (DiT)-based generation models have achieved remarkable success in video generation. However, their inherent computational demands pose significant efficiency challenges. In this paper, we exploit the inherent temporal non-uniformity of real-world videos and observe that videos exhibit dynamic information density, with high-motion segments demanding greater detail preservation than static scenes. Inspired by this temporal non-uniformity, we propose DLFR-Gen, a training-free approach for Dynamic Latent Frame Rate Generation in Diffusion Transformers. DLFR-Gen adaptively adjusts the number of elements in latent space based on the motion frequency of the latent space content, using fewer tokens for low-frequency segments while preserving detail in high-frequency segments. Specifically, our key contributions are: A dynamic frame rate scheduler for DiT video generation that adaptively assigns frame rates for video segments. A novel latent-space frame merging method to align latent representations with their denoised counterparts before merging those redundant in low-resolution space. A preference analysis of Rotary Positional Embeddings (RoPE) across DiT layers, informing a tailored RoPE strategy optimized for semantic and local information capture. Experiments show that DLFR-Gen can achieve a speedup of up to 3 times for video generation with minimal quality degradation.
Poster
Ibtihel Amara · Ahmed Imtiaz Humayun · Ivana Kajic · Zarana Parekh · Natalie Harris · Sarah Young · Chirag Nagpal · Najoung Kim · Junfeng He · Cristina Vasconcelos · Deepak Ramachandran · Golnoosh Farnadi · Katherine Heller · Mohammad Havaei · Negar Rostamzadeh

[ Exhibit Hall I ]

Abstract
Concept erasure techniques have recently gained significant attention for their potential to remove unwanted concepts from text-to-image models. While these methods often demonstrate promising results in controlled settings, their robustness in real-world applications and suitability for deployment remain uncertain. In this work, we (1) identify a critical gap in evaluating sanitized models, particularly in assessing their performance across diverse concept dimensions, and (2) systematically analyze the failure modes of text-to-image models post-erasure. We focus on the unintended consequences of concept removal on non-target concepts across different levels of interconnected relationships including visually similar, binomial, and semantically related concepts. To enable a more comprehensive evaluation of concept erasure, we introduce EraseBench, a multidimensional framework designed to rigorously assess text-to-image models post-erasure. It encompasses over 100 diverse concepts, carefully curated seeded prompts to ensure reproducible image generation, and dedicated evaluation prompts for model-based assessment. Paired with a robust suite of evaluation metrics, our framework provides a holistic and in-depth analysis of concept erasure’s effectiveness and its long-term impact on model behaviour.Our findings reveal a phenomenon of concept entanglement, where erasure leads to unintended suppression of non-target concepts, causing spillover degradation that manifests as distortions and a decline in generation quality.
Poster
Revant Teotia · Candace Ross · Karen Ullrich · Sumit Chopra · Adriana Romero-Soriano · Melissa Hall · Matthew Muckley

[ Exhibit Hall I ]

Abstract
Recent advances in text-to-image (T2I) models have achieved impressive quality and consistency. However, this has come at the cost of representation diversity. While automatic evaluation methods exist for benchmarking model diversity, they either require reference image datasets or lack specificity about the kind of diversity measured, limiting their adaptability and interpretability. To address this gap, we introduce the Does-it/Can-it framework, DIM-CIM, a reference-free measurement of default-mode diversity (“Does” the model generate images with expected attributes?) and generalization capacity (“Can” the model generate diverse attributes for a particular concept?). We construct the COCO-DIMCIM benchmark, which is seeded with COCO concepts and captions and augmented by a large language model. With COCO-DIMCIM, we find that widely-used models improve in generalization at the cost of default-mode diversity when scaling from 1.5B to 8.1B parameters. DIMCIM also identifies fine-grained failure cases, such as attributes that are generated with generic prompts but are rarely generated when explicitly requested. Finally, we use DIMCIM to evaluate the training data of a T2I model and observe a correlation of 0.85 between diversity in training images and default-mode diversity. Our work provides a flexible and interpretable framework for assessing T2I model diversity and generalization, enabling a more comprehensive understanding …
Poster
Anthony Bisulco · Rahul Ramesh · Randall Balestriero · Pratik Chaudhari

[ Exhibit Hall I ]

Abstract
Masked Autoencoders (MAEs) have emerged as a powerful pretraining technique for vision foundation models. Despite their effectiveness, they require extensive hyperparameter tuning across factors such as masking ratio, patch size, number of encoder and decoder layers, as researchers use these methods for different applications. While prior theoretical work has analyzed MAEs through the lens of attention patterns and hierarchical latent variable models, the connection between MAE hyperparameters and the performance on downstream tasks is relatively unexplored. In this work, we investigate the perspective that "MAEs learn spatial correlations in the input image". We analytically derive the features learnt by a linear MAE and show that masking ratio and patch size can be used to select between features capturing short- and long-range spatial correlations. Extending this analysis to nonlinear MAEs, we show that learned representations in MAEs adapt to spatial correlations in the dataset, beyond second-order statistics. Finally, we discuss some insights on how to select MAE hyper-parameters in practice.
Poster
Sunung Mun · Jinhwan Nam · Sunghyun Cho · Jungseul Ok

[ Exhibit Hall I ]

Abstract
Text-guided image editing with diffusion models enables flexible modifications, but editing multiple objects remains challenging due to unintended attribute interference, where edits affect non-target regions or mix attributes within the target areas. We identify the End-of-Sequence (EOS) token embeddings as a key factor in this issue, introducing global semantics that disrupt intended modifications. To address this, we propose Attribute-LEakage-free Editing (ALE-Edit), an approach that is both effective, by properly addressing EOS-induced interference, and efficient, as it requires no additional fine-tuning. ALE-Edit consists of: (1) Object-Restricted Embedding (ORE) to localize attributes, (2) Region-Guided Blending for Cross-Attention Masking (RGB-CAM) to align attention with target regions, and (3) Background Blending (BB) to preserve structural consistency. Additionally, we introduce ALE-Bench, a benchmark to quantify target-external and target-internal interference. Experiments show that ALE-Edit reduces unintended changes while maintaining high-quality edits, outperforming existing tuning-free methods. Our approach provides a scalable and computationally efficient solution for multi-object image editing.
Poster
Xiaolong Jin · Zixuan Weng · Hanxi Guo · Chenlong Yin · Siyuan Cheng · Guangyu Shen · Xiangyu Zhang

[ Exhibit Hall I ]

Abstract
Diffusion models are widely used in real-world applications, but ensuring their safety remains a major challenge. Despite many efforts to enhance the security of diffusion models, jailbreak and adversarial attacks can still bypass these defenses, generating harmful content. However, the lack of standardized evaluation makes it difficult to assess the robustness of diffusion model system.To address this, we introduce JailbreakDiffBench, a comprehensive benchmark for systematically evaluating the safety of diffusion models against various attacks and under different defenses. Our benchmark includes a high-quality, human-annotated prompt and image dataset covering diverse attack scenarios. It consists of two key components: (1) an evaluation protocol to measure the effectiveness of moderation mechanisms and (2) an attack assessment module to benchmark adversarial jailbreak strategies.Through extensive experiments, we analyze existing filters and reveal critical weaknesses in current safety measures. JailbreakDiffBench is designed to support both text-to-image and text-to-video models, ensuring extensibility and reproducibility.The code is available at https://anonymous.4open.science/r/jailbreakdiffbench/
Poster
Tongkai Shi · Lianyu Hu · Fanhua Shang · Liqing Gao · Wei Feng

[ Exhibit Hall I ]

Abstract
Sign Language Video Generation (SLVG) aims to transform sign language sequences into natural and fluent sign language videos. Existing SLVG methods lack geometric modeling of human anatomical structures, leading to anatomically implausible and temporally inconsistent generation. To address these challenges, we propose a novel SLVG framework: Geometry-Aware Region Refinement (GReg). GReg uses 3D geometric information (such as normal maps and gradient maps) from the SMPL-X model to ensure anatomical and temporal consistency.To fully leverage the 3D geometric priors, we propose two novel methods: 1) Regional Prior Generation, which uses regional expert networks to generate target-structured regions as generation priors; 2) Gradient-Enhanced Refinement, which guides the refinement of detailed structures in key regions using gradient features.Furthermore, we enhance visual realism in key regions through adversarial training on both these regions and their gradient maps.Experimental results demonstrate that GReg achieves state-of-the-art performance with superior structural accuracy and temporal consistency.
Poster
Yaqing Ding · Viktor Kocur · VACLAV VAVRA · Zuzana Berger Haladova · jian Yang · Torsten Sattler · Zuzana Kukelova

[ Exhibit Hall I ]

Abstract
Recent advances in monocular depth estimation methods (MDE) and their improved accuracy open new possibilities for their applications. In this paper, we investigate how monocular depth estimates can be used for relative pose estimation. In particular, we are interested in answering the question whether using MDEs improves results over traditional point-based methods. We propose a novel framework for estimating the relative pose of two cameras from point correspondences with associated monocular depths. Since depth predictions are typically defined up to an unknown scale or even both unknown scale and shift parameters, our solvers jointly estimate the scale or both the scale and shift parameters along with the relative pose. We derive efficient solvers considering different types of depths for three camera configurations: (1) calibrated cameras, (2) cameras with an unknown shared focal length, and (3) cameras with unknown different focal lengths. Our new solvers outperform state-of-the-art depth-aware solvers in terms of speed and accuracy. In extensive real experiments on multiple datasets and with various MDEs, we discuss which depth-aware solvers are preferable in which situation. The code will be made publicly available.
Poster
Chengtang Yao · Lidong Yu · Zhidan Liu · Jiaxi Zeng · Yuwei Wu · Yunde Jia

[ Exhibit Hall I ]

Abstract
The matching formulation makes it naturally hard for the stereo matching to handle ill-posed regions like occlusions and non-Lambertian surfaces. Fusing monocular priors has been proven helpful for ill-posed matching, but the biased monocular prior learned from small stereo datasets constrains the generalization. Recently, stereo matching has progressed by leveraging the unbiased monocular prior from the vision foundation model (VFM) to improve the generalization in ill-posed regions. We dive into the fusion process and observe three main problems limiting the fusion of the VFM monocular prior. The first problem is the misalignment between affine-invariant relative monocular depth and absolute depth of disparity. Besides, when we use the monocular feature in an iterative update structure, the over-confidence in the disparity update leads to local optima results. A direct fusion of a monocular depth map could alleviate the local optima problem, but noisy disparity results computed at the first several iterations will misguide the fusion. In this paper, we propose a binary local ordering map to guide the fusion, which converts the depth map into a binary relative format, unifying the relative and absolute depth representation. The computed local ordering map is also used to re-weight the initial disparity update, resolving the …
Poster
Dale Decatur · Thibault Groueix · Wang Yifan · Rana Hanocka · Vladimir Kim · Matheus Gadelha

[ Exhibit Hall I ]

Abstract
Text-to-image diffusion models enable high-quality image generation but are computationally expensive, especially when producing large image collections. While prior work optimizes per-inference efficiency, we explore an orthogonal approach: reducing redundancy across multiple correlated prompts. Our key insight leverages the coarse-to-fine nature of diffusion models, where early denoising steps capture shared structures among similar prompts. We propose a training-free method that clusters prompts based on semantic similarity and shares computation in early diffusion steps. Experiments show that for models trained conditioned on image embeddings, our approach significantly reduces compute cost while improving image quality. By leveraging UnClip’s text-to-image prior, we enhance diffusion step allocation for greater efficiency. Our method seamlessly integrates with existing pipelines, scales with prompt sets, and reduces the environmental and financial burden of large-scale text-to-image generation.
Poster
Tiange Xiang · Kai Li · Chengjiang Long · Christian Häne · Peihong Guo · Scott Delp · Ehsan Adeli · Li Fei-Fei

[ Exhibit Hall I ]

Abstract
Text-to-image diffusion models have seen significant development recently due to increasing availability of paired 2D data. Although a similar trend is emerging in 3D generation, the limited availability of high-quality 3D data has resulted in less competitive 3D diffusion models compared to their 2D counterparts. In this work, we show how 2D diffusion models, originally trained for text-to-image generation, can be repurposed for 3D object generation. We introduce Gaussian Atlas, a representation of 3D Gaussians with dense 2D grids, which enables the fine-tuning of 2D diffusion models for generating 3D Gaussians. Our approach shows a successful transfer learning from a pretrained 2D diffusion model to 2D manifold flattend from 3D structures. To facilitate model training, a large-scale dataset, Gaussian Atlas, is compiled to comprise 205K high-quality 3D Gaussian fittings of a diverse array of 3D objects. Our experiment results indicate that text-to-image diffusion models can also serve as 3D content generators.
Poster
Junyu Xie · Tengda Han · Max Bain · Arsha Nagrani · Eshika Khandelwal · Gül Varol · Weidi Xie · Andrew Zisserman

[ Exhibit Hall I ]

Abstract
Our objective is automatic generation of Audio Descriptions (ADs) for edited video material, such as movies and TV series. To achieve this, we propose a two-stage framework that leverages "shots" as the fundamental units of video understanding. This includes extending temporal context to neighboring shots and incorporating film grammar devices, such as shot scales and thread structures, to guide AD generation. Our method is compatible with both open-source and proprietary Visual-Language Models (VLMs), integrating expert knowledge from add-on modules without requiring additional training of the VLMs. We achieve state-of-the-art performance among all prior training-free approaches and even surpass fine-tuned methods on several benchmarks. To evaluate the quality of predicted ADs, we introduce a new evaluation measure -- an action score -- specifically targeted to assessing this important aspect of AD. Additionally, we propose a novel evaluation protocol that treats automatic frameworks as AD generation assistants and asks them to generate multiple candidate ADs for selection.
Poster
Haowei Kuang · Wenhan Yang · Zongming Guo · Jiaying Liu

[ Exhibit Hall I ]

Abstract
Learned image compression aims to reduce redundancy by accurately modeling the complex signal distribution inherent in images with network parameters. However, existing practices that train models on entire dataset offline face a limitation, as the estimated distribution only approximates the general image signal distribution and fails to capture image-specific characteristics‌. To address this issue, we propose a cross-granularity online optimization strategy to mitigate information loss from two key aspects: statistical distribution gaps and local structural gaps. This strategy introduces additional fitted bitstream to push the estimated signal distribution closer to the real one at both coarse-grained and fine-grained levels. For coarse-grained optimization, we relax the common bitrate constraints during gradient descent and reduce bitrate cost via adaptive QP (Quantization Parameter) selection, preventing information collapse and narrowing the statistical distribution gaps. For fine-grained optimization, a Mask-based Selective Compensation Module is designed to sparsely encode structural characteristics at low bitrates, enhancing local distribution alignment. By jointly optimizing global and local distributions, our method achieves closer alignment to real image statistics and significantly enhances the performance. Extensive experiments validate the superiority of our method as well as the design of our module. Our project will be publicly available.
Poster
Nupur Kumari · Xi Yin · Jun-Yan Zhu · Ishan Misra · Samaneh Azadi

[ Exhibit Hall I ]

Abstract
Customization of text-to-image models enables users to insert custom concepts or objects and generate them in unseen settings. Existing methods either rely on comparatively expensive test-time optimization or train encoders on single-image datasets without multi-image supervision, which can limit image quality. We propose a simple approach to address these challenges. We first leverage existing text-to-image models and 3D datasets to create a high-quality Synthetic Customization Dataset (SynCD) consisting of multiple images of the same object in different lighting, backgrounds, and poses. Using this dataset, we train an encoder-based model that conditions on reference images via a shared attention mechanism to better incorporate fine-grained visual details from reference images. Finally, we propose a new inference technique that normalizes text and image guidance vectors to mitigate overexposure issues during inference. Through extensive experiments, we show that our encoder-based model, trained on the synthetic dataset with the proposed inference algorithm, improves upon existing encoder-based methods on standard customization benchmarks.
Poster
Inkyu Shin · Chenglin Yang · Liang-Chieh (Jay) Chen

[ Exhibit Hall I ]

Abstract
Flow-based generative models have charted an impressive path across multiple visual generation tasks by adhering to a simple principle: learning velocity representations of a linear interpolant. However, we observe that training velocity solely from the final layer’s output under-utilizes the rich inter-layer representations, potentially impeding model convergence. To address this limitation, we introduce **DeepFlow**, a novel framework that enhances velocity representation through inter-layer communication. DeepFlow partitions transformer layers into balanced branches with deep supervision and inserts a lightweight Velocity Refiner with Acceleration (VeRA) block between adjacent branches, which aligns the intermediate velocity features within transformer blocks. Powered by the improved deep supervision via the internal velocity alignment, DeepFlow converges **8x faster** on ImageNet-256x256 with equivalent performance and further reduces FID by **2.6** while halving training time compared to previous flow-based models without a classifier-free guidance. DeepFlow also outperforms baselines in text-to-image generation tasks, as evidenced by evaluations on MS-COCO and zero-shot GenEval. The code will be made publicly available.
Poster
Rui Yang · Huining Li · Yiyi Long · Xiaojun Wu · Shengfeng He

[ Exhibit Hall I ]

Abstract
Generating sketches guided by reference styles requires precise transfer of stroke attributes, such as line thickness, deformation, and texture sparsity, while preserving semantic structure and content fidelity. To this end, we propose Stroke2Sketch, a novel training-free framework that introduces cross-image stroke attention, a mechanism embedded within self-attention layers to establish fine-grained semantic correspondences and enable accurate stroke attribute transfer. This allows our method to adaptively integrate reference stroke characteristics into content images while maintaining structural integrity. Additionally, we develop adaptive contrast enhancement and semantic-focused attention to reinforce content preservation and foreground emphasis. Stroke2Sketch effectively synthesizes stylistically faithful sketches that closely resemble handcrafted results, outperforming existing methods in expressive stroke control and semantic coherence.
Poster
Jinhong Ni · Chang-Bin Zhang · Qiang Zhang · Jing Zhang

[ Exhibit Hall I ]

Abstract
Recent prosperity of text-to-image diffusion models, e.g. Stable Diffusion, has stimulated research to adapt them to 360-degree panorama generation. Prior work has demonstrated the feasibility of using conventional low-rank adaptation techniques on pre-trained diffusion models to generate panoramic images. However, the substantial domain gap between perspective and panoramic images raises questions about the underlying mechanisms enabling this empirical success. We hypothesize and examine that the trainable counterparts exhibit distinct behaviors when fine-tuned on panoramic data, and such an adaptation conceals some intrinsic mechanism to leverage the prior knowledge within the pre-trained diffusion models. Our analysis reveals the following: 1) the query and key matrices in the attention modules are responsible for common information that can be shared between the panoramic and perspective domains, thus are less relevant to panorama generation; and 2) the value and output weight matrices specialize in adapting pre-trained knowledge to the panoramic domain, playing a more critical role during fine-tuning for panorama generation. We empirically verify these insights by introducing a simple framework called UniPano, with the objective of establishing an elegant baseline for future research. UniPano not only outperforms existing methods but also significantly reduces memory usage and training time compared to prior dual-branch approaches, …
Poster
Wenchuan Wang · Mengqi Huang · Yijing Tu · Zhendong Mao

[ Exhibit Hall I ]

Abstract
Customized text-to-video generation with pre-trained large-scale models has recently garnered significant attention through focusing on identity and motion consistency. Existing works typically follow the isolated customized paradigm, where the subject identity or motion dynamics are customized exclusively. However, this paradigm completely ignores the intrinsic $\textbf{mutual constraints and synergistic interdependencies}$ between identity and motion, resulting in identity-motion conflicts throughout the generation process that systematically degrades. To address this, we introduce $\textbf{DualReal}$, a novel framework that, employs adaptive joint training to collaboratively construct interdependencies between dimensions. Specifically, DualReal is composed of two units: (1) $\textbf{Dual-aware Adaptation}$ dynamically selects a training phase ($\textit{i.e.}$, identity or motion), learns the current information guided by the frozen dimension prior, and employs a regularization strategy to avoid knowledge leakage; (2) $\textbf{StageBlender Controller}$ leverages the denoising stages and Diffusion Transformer depths to guide different dimensions with adaptive granularity, avoiding conflicts at various stages and ultimately achieving lossless fusion of identity and motion patterns. We constructed a more comprehensive evaluation benchmark than existing methods. The experimental results show that DualReal improves CLIP-I and DINO-I metrics by $\textbf{21.7}$% and $\textbf{31.8}$% on average, and achieves top performance on nearly all motion quality metrics.
Poster
Naresh Kumar Devulapally · Mingzhen Huang · Vishal Asnani · Shruti Agarwal · Siwei Lyu · Vishnu Lokhande

[ Exhibit Hall I ]

Abstract
Invisible watermarking of AI-generated images can help with copyright protection, enabling detection and identification of AI-generated media. In this work, we present a novel approach to watermark images of text-to-image Latent Diffusion Models (LDMs). By only fine-tuning text token embeddings $\mathcal{W}_*$, we enable watermarking in selected objects or parts of the image, offering greater flexibility compared to traditional whole-image watermarking. This method also leverages the text encoder’s compatibility across various LDMs, allowing plug-and-play integration for different LDMs. Moreover, introducing the watermark early in the encoding stage improves robustness to adversarial perturbations in later stages of the pipeline. Our approach achieves $99 \%$ bit accuracy ($48$ bits) with a $10^5 \times$ reduction in model parameters, enabling efficient watermarking.
Poster
Yulin Pan · Xiangteng He · Chaojie Mao · Zhen Han · Zeyinzi Jiang · Jingfeng Zhang · Yu Liu

[ Exhibit Hall I ]

Abstract
Image generation has witnessed significant advancements in the past few years. However, evaluating the performance of image generation models remains a formidable challenge. In this paper, we propose ICE-Bench, a unified and comprehensive benchmark designed to rigorously assess image generation models. Its comprehensiveness could be summarized in the following key features: (1) Coarse-to-Fine Tasks: We systematically deconstruct image generation into four task categories: No-ref/Ref Image Creating/Editing, based on the presence or absence of source images and reference images. And further decompose them into 31 fine-grained tasks covering a broad spectrum of image generation requirements, culminating in a comprehensive benchmark. (2) Multi-dimensional Metrics: The evaluation framework assesses image generation capabilities across 6 dimensions: aesthetic quality, imaging quality, prompt following, source consistency, reference consistency, and controllability. 11 metrics are introduced to support the multi-dimensional evaluation. Notably, we introduce VLLM-QA, an innovative metric designed to assess the success of image editing by leveraging large models. (3) Hybrid Data: The data comes from real scenes and virtual generation, which effectively improves data diversity and alleviates the bias problem in model evaluation. Through ICE-Bench, we conduct a thorough analysis of existing generation models, revealing both the challenging nature of our benchmark and the gap between …
Poster
Haiming Zhu · Yangyang Xu · Chenshu Xu · Tingrui Shen · Wenxi Liu · Yong Du · Jun Yu · Shengfeng He

[ Exhibit Hall I ]

Abstract
Text-guided image and 3D editing have advanced with diffusion-based models, yet methods like Delta Denoising Score often struggle with stability, spatial control, and editing strength. These limitations stem from reliance on complex auxiliary structures, which introduce conflicting optimization signals and restrict precise, localized edits. We introduce Stable Score Distillation (SSD), a streamlined framework that enhances stability and alignment in the editing process by anchoring a single classifier to the source prompt. Specifically, SSD utilizes CFG equation to achieves cross-prompt alignment, and introduces a constant term null-text branch to stabilize the optimization process. This approach preserves the original content’s structure and ensures that editing trajectories are closely aligned with the source prompt, enabling smooth, prompt-specific modifications while maintaining coherence in surrounding regions. Additionally, SSD incorporates a prompt enhancement branch to boost editing strength, particularly for style transformations. Our method achieves state-of-the-art results in 2D and 3D editing tasks, including NeRF and textdriven style edits, with faster convergence and reduced complexity, providing a robust and efficient solution for text-guided editing.
Poster
Tianrui Zhu · Shiyi Zhang · Jiawei Shao · Yansong Tang

[ Exhibit Hall I ]

Abstract
Background consistency remains a significant challenge in image editing tasks. Despite extensive developments, existing works still face a trade-off between maintaining similarity to the original image and generating content that aligns with the target. Here, we propose KV-Edit, a training-free approach that uses KV cache in DiTs to maintain background consistency, where background tokens are preserved rather than regenerated, eliminating the need for complex mechanisms or expensive training, ultimately generating new content that seamlessly integrates with the background within user-provided regions. We further explore the memory consumption of the KV cache during editing and optimize the space complexity to $O(1)$ using an inversion-free method. Our approach is compatible with any DiT-based generative model without additional training. Experiments demonstrate that KV-Edit significantly outperforms existing approaches in terms of both background and image quality, even surpassing training-based methods.
Poster
Junchao Huang · Xinting Hu · Shaoshuai Shi · Zhuotao Tian · Li Jiang

[ Exhibit Hall I ]

Abstract
Recent advances in diffusion models have significantly improved image generation and editing, but extending these capabilities to 3D assets remains challenging, especially for fine-grained edits that require multi-view consistency. Existing methods typically restrict editing to predetermined viewing angles, severely limiting their flexibility and practical applications.We introduce Edit360, a tuning-free framework that extends 2D modifications to multi-view consistent 3D editing. Built upon video diffusion models, Edit360 enables user-specific editing from arbitrary viewpoints while ensuring structural coherence across all views. The framework selects anchor views for 2D modifications and propagates edits across the entire 360-degree range. To achieve this, Edit360 introduces a novel Anchor-View Editing Propagation mechanism, which effectively aligns and merges multi-view information within the latent and attention spaces of diffusion models. The resulting edited multi-view sequences facilitate the reconstruction of high-quality 3D assets, enabling customizable 3D content creation.
Poster
Ju He · Qihang Yu · Qihao Liu · Liang-Chieh (Jay) Chen

[ Exhibit Hall I ]

Abstract
Bridging different modalities lies at the heart of cross-modality generation. While conventional approaches treat the text modality as a conditioning signal that gradually guides the denoising process from Gaussian noise to the target image modality, we explore a much simpler paradigm—directly evolving between text and image modalities through flow matching. This requires projecting both modalities into a shared latent space, which poses a significant challenge due to their inherently different representations: text is highly semantic and encoded as 1D tokens, whereas images are spatially redundant and represented as 2D latent embeddings. To address this, we introduce FlowTok, a minimal framework that seamlessly flows across text and images by encoding images into a compact 1D token representation. Compared to prior methods, this design reduces the latent space size by 3.3$\times$ at an image resolution of 256, eliminating the need for complex conditioning mechanisms or noise scheduling. Moreover, FlowTok naturally extends to image-to-text generation under the same formulation. With its streamlined architecture centered around compact 1D tokens, FlowTok is highly memory-efficient, requires significantly fewer training resources, and achieves much faster sampling speeds—all while delivering performance comparable to state-of-the-art models. Code will be available.
Poster
Xin Wen · Bingchen Zhao · Ismail Elezi · Jiankang Deng · Xiaojuan Qi

[ Exhibit Hall I ]

Abstract
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space. While existing visual tokenizers primarily optimize for reconstruction fidelity, they often neglect the structural properties of the latent space—a critical factor for both interpretability and downstream tasks. Our method generates a 1D causal token sequence for images, where each successive token contributes non-overlapping information with mathematically guaranteed decreasing explained variance, analogous to principal component analysis. This structural constraint ensures the tokenizer extracts the most salient visual features first, with each subsequent token adding diminishing yet complementary information. Additionally, we identified and resolved a semantic-spectrum coupling effect that causes the unwanted entanglement of high-level semantic content and low-level spectral details in the tokens by leveraging a diffusion decoder. Experiments demonstrate that our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system. Moreover, auto-regressive models trained on our token sequences achieve performance comparable to current state-of-the-art methods while requiring fewer tokens for training and inference. Our code and models will be made publicly available.
Poster
Minghao Fu · Guo-Hua Wang · Xiaohao Chen · Qing-Guo Chen · Zhao Xu · Weihua Luo · Kaifu Zhang

[ Exhibit Hall I ]

Abstract
Recent advances in text-to-image synthesis largely benefit from sophisticated sampling strategies and classifier-free guidance (CFG) to ensure high-quality generation. However, CFG's reliance on two forward passes, especially when combined with intricate sampling algorithms, results in prohibitively high inference costs. To address this, we introduce TeEFusion (**Te**xt **E**mbeddings **Fusion**), a novel and efficient distillation method that directly incorporates the guidance magnitude into the text embeddings and distills the teacher model's complex sampling strategy. By simply fusing conditional and unconditional text embeddings using linear operations, TeEFusion reconstructs the desired guidance without adding extra parameters, simultaneously enabling the student model to learn from the teacher's output produced via its sophisticated sampling approach. Extensive experiments on state-of-the-art models such as SD3 demonstrate that our method allows the student to closely mimic the teacher's performance with a far simpler and more efficient sampling strategy. Consequently, the student model achieves inference speeds up to 6$\times$ faster than the teacher model, while maintaining image quality at levels comparable to those obtained through the teacher's complex sampling approach.
Poster
Christian Simon · Masato Ishii · Akio Hayakawa · Zhi Zhong · Shusuke Takahashi · Takashi Shibuya · Yuki Mitsufuji

[ Exhibit Hall I ]

Abstract
In the recent development of conditional diffusion models still require heavy supervised fine-tuning for performing control on a category of tasks. Training-free conditioning via guidance with off-the-shelf models is a favorable alternative to avoid further fine-tuning on the base model. However, the existing training-free guidance frameworks either heavy memory requirements or sub-optimal control due to rough estimation. These shortcomings limit the applicability to control diffusion models that require intense computation, such as Text-to-Video (T2V) diffusion models. In this work, we propose Taming Inference Time Alignment for Guided Text-to-Video Diffusion Model, so-called TITAN-Guide, which overcomes memory space issues, and provides more optimal control in the guidance process compared to the counterparts. In particular, we develop an efficient method for optimizing diffusion latents without backpropagation from a discriminative guiding model. In particular, we study forward gradient descents for guided diffusion tasks with various options on directional directives. In our experiments, we demonstrate the effectiveness of our approach in efficiently managing memory during latent optimization, while previous methods fall short. Our proposed approach not only minimizes memory requirements but also significantly enhances T2V performance across a range of diffusion guidance benchmarks.
Poster
Minghan LI · Chenxi Xie · Yichen Wu · Lei Zhang · Mengyu Wang

[ Exhibit Hall I ]

Abstract
Numerous text-to-video (T2V) editing methods have emerged recently, but the lack of a standardized benchmark for fair evaluation has led to inconsistent claims and an inability to assess model sensitivity to hyperparameters. Fine-grained video editing is crucial for enabling precise, object-level modifications while maintaining context and temporal consistency. To address this, we introduce **FiVE**, a Fine-grained Video Editing Benchmark for evaluating emerging diffusion and rectified flow models. Our benchmark includes 74 real-world videos and 26 generated videos, featuring 6 fine-grained editing types, 420 object-level editing prompt pairs, and their corresponding masks.Additionally, we adapt the latest rectified flow (RF) T2V generation models—Pyramid-Flow and Wan2.1—by introducing FlowEdit, resulting in training-free and inversion-free video editing models **Pyramid-Edit** and **Wan-Edit**. We compare six diffusion-based editing methods with our proposed two RF-based editing methods on our proposed FiVE benchmark, evaluating them across 14 metrics. These metrics include background preservation, text-video similarity, temporal consistency, and generated video quality. To further enhance object-level evaluation, we introduce **FiVE-Acc**, a novel metric leveraging Vision-Language Models (VLMs) to assess the success of fine-grained video editing. Experimental results demonstrate that RF-based editing significantly outperforms diffusion-based methods, with Wan-Edit achieving the best overall performance and exhibiting the least sensitivity to hyperparameters. More …
Poster
Zixin Zhu · Kevin Duarte · Mamshad Nayeem Rizve · Chengyuan Xu · Ratheesh Kalarot · Junsong Yuan

[ Exhibit Hall I ]

Abstract
In text-to-image (T2I) generation, achieving fine-grained control over attributes - such as age or smile - remains challenging, even with detailed text prompts. Slider-based methods offer a solution for precise control of image attributes.Existing approaches typically train individual adapter for each attribute separately, overlooking the entanglement among multiple attributes. As a result, interference occurs among different attributes, preventing precise control of multiple attributes together. To address this challenge, we aim to disentangle multiple attributes in slider-based generation to enbale more reliable and independent attribute manipulation. Our approach, CompSlider, can generate a conditional prior for the T2I foundation model to control multiple attributes simultaneously. Furthermore, we introduce novel disentanglement and structure losses to compose multiple attribute changes while maintaining structural consistency within the image. Since CompSlider operates in the latent space of the conditional prior and does not require retraining the foundation model, it reduces the computational burden for both training and inference. We evaluate our approach on a variety of image attributes and highlight its generality by extending to video generation.
Poster
Yuhui WU · Liyi Chen · Ruibin Li · Shihao Wang · Chenxi Xie · Lei Zhang

[ Exhibit Hall I ]

Abstract
Instruction-based video editing allows effective and interactive editing of videos using only instructions without extra inputs such as masks or attributes. However, collecting high-quality training triplets (source video, edited video, instruction) is a challenging task. Existing datasets mostly consist of low-resolution, short duration, and limited amount of source videos with unsatisfactory editing quality, limiting the performance of trained editing models. In this work, we present a high-quality \textbf{Ins}truction-based \textbf{Vi}deo \textbf{E}diting dataset with \textbf{1M} triplets, namely \textbf{InsViE-1M}. We first curate high-resolution and high-quality source videos and images, then design an effective editing-filtering pipeline to construct high-quality editing triplets for model training. For a source video, we generate multiple edited samples of its first frame with different intensities of classifier-free guidance, which are automatically filtered by GPT-4o with carefully crafted guidelines. The edited first frame is propagated to subsequent frames to produce the edited video, followed by another round of filtering for frame quality and motion evaluation. We also generate and filter a variety of video editing triplets from high-quality images. With the InsViE-1M dataset, we propose a multi-stage learning strategy to train our InsViE model, progressively enhancing its instruction following and editing ability. Extensive experiments demonstrate the advantages of our InsViE-1M …
Poster
Zhaotong Yang · Yuhui Li · Shengfeng He · Xinzhe Li · Yangyang Xu · Junyu Dong · Yong Du

[ Exhibit Hall I ]

Abstract
Image-based Virtual Try-On (VTON) techniques rely on either supervised in-shop approaches, which ensure high fidelity but struggle with cross-domain generalization, or unsupervised in-the-wild methods, which improve adaptability but remain constrained by data biases and limited universality. A unified, training-free solution that works across both scenarios remains an open challenge.We propose OmniVTON, the first training-free universal VTON framework that decouples garment and pose conditioning to achieve both texture fidelity and pose consistency across diverse settings. To preserve garment details, we introduce a garment prior generation mechanism that aligns clothing with the body, followed by continuous boundary stitching technique to achieve fine-grained texture retention. For precise pose alignment, we utilize DDIM inversion to capture structural cues while suppressing texture interference, ensuring accurate body alignment independent of the original image textures. By disentangling garment and pose constraints, OmniVTON eliminates the bias inherent in diffusion models when handling multiple conditions simultaneously.Experimental results demonstrate that OmniVTON achieves superior performance across diverse datasets, garment types, and application scenarios. Notably, it is the first framework capable of multi-human VTON, enabling realistic garment transfer across multiple individuals in a single scene.
Poster
Dewei Zhou · Mingwei Li · Zongxin Yang · Yi Yang

[ Exhibit Hall I ]

Abstract
Image-conditioned generation methods, such as depth- and canny-conditioned approaches, have demonstrated remarkable abilities for precise image synthesis. However, existing models still struggle to accurately control the content of multiple instances (or regions). Even state-of-the-art models like FLUX and 3DIS face challenges, such as attribute leakage between instances, which limits user control. To address these issues, we introduce DreamRenderer, a training-free approach built upon the FLUX model. DreamRenderer enables users to control the content of each instance via bounding boxes or masks, while ensuring overall visual harmony. We propose two key innovations: 1) Bridge Image Tokens for Hard Text Attribute Binding, which uses replicated image tokens as bridge tokens to ensure that T5 text embeddings, pre-trained solely on text data, bind the correct visual attributes for each instance during Joint Attention; 2) Hard Image Attribute Binding applied only to vital layers. Through our analysis of FLUX, we identify the critical layers responsible for instance attribute rendering and apply Hard Image Attribute Binding only in these layers, using soft binding in the others. This approach ensures precise control while preserving image quality. Evaluations on the COCO-POS and COCO-MIG benchmarks demonstrate that DreamRenderer improves the Image Success Ratio by 17.7\% over FLUX and …
Poster
Yuzhuo Chen · Zehua Ma · Han Fang · Weiming Zhang · Nenghai Yu

[ Exhibit Hall I ]

Abstract
AI-generated content (AIGC) enables efficient visual creation but raises copyright and authenticity risks. As a common technique for integrity verification and source tracing, digital image watermarking is regarded as a potential solution to above issues. Among these, watermarking methods capable of preserving the generation quality are receiving increased attention. However, the proliferation and high performance of generative image editing applications have elevated the risks of malicious tampering, creating new demands. 1) The tamper robustness of current lossless visual quality watermarks remains constrained by the modification-sensitive diffusion inversion process, necessitating enhanced robustness. 2) The improved tampering quality and rapid iteration cycles render passive tampering detection methods inadequate, making proactive tampering localization capability a desired feature for watermarks. To address these requirements, this paper proposes a Tamper-Aware Generative image WaterMarking method named TAG-WM. The proposed method comprises three key modules: a dual-mark joint sampling algorithm for embedding copyright and localization watermarks into the latent space while preserving generative quality, a dense variation region detector leveraging diffusion inversion sensitivity to identify tampered areas via statistical deviation analysis, and a tamper-aware message decoder guided by localization results. The experimental results indicate that TAG-WM achieves SOTA tampering robustness and tampering localization capability with distortions while …
Poster
jian ma · Qirong Peng · Xu Guo · Chen Chen · Haonan Lu · Zhenyu Yang

[ Exhibit Hall I ]

Abstract
Text-to-image (T2I) models are well known for their ability to produce highly realistic images, while multimodal large language models (MLLMs) are renowned for their proficiency in understanding and integrating multiple modalities. However, currently there is no straightforward and efficient framework to transfer the multimodal comprehension abilities of MLLMs to T2I models to enable them to understand multimodal inputs. In this paper, we propose the X2I framework, which endows Diffusion Transformer (DiT) models with the capability to comprehend various modalities, including multilingual text, screenshot documents, images, videos, and audio. X2I is trained using merely 100K English corpus with 160 GPU hours. Building on the DiT teacher model, we adopt an innovative distillation method to extract the inference capabilities of the teacher model and design a lightweight AlignNet structure to serve as an intermediate bridge. Compared to the teacher model, X2I shows a decrease in performance degradation of less than 1\% while gaining various multimodal understanding abilities. Furthermore, it is applicable for LoRA training in the context of image-text to image generation, filling a void in the industry in this area. We further design a simple LightControl to enhance the fidelity of instructional image editing. Finally, extensive experiments demonstrate the effectiveness, efficiency, …
Poster
Tianyi Wei · Yifan Zhou · Dongdong Chen · Xingang Pan

[ Exhibit Hall I ]

Abstract
The integration of Rotary Position Embedding (RoPE) in Multimodal Diffusion Transformer (MMDiT) has significantly enhanced text-to-image generation quality. However, the fundamental reliance of self-attention layers on positional embedding versus query-key similarity during generation remains an intriguing question. We present the first mechanistic analysis of RoPE-based MMDiT models (e.g., FLUX), introducing an automated probing strategy that disentangles positional information versus content dependencies by strategically manipulating RoPE during generation. Our analysis reveals distinct dependency patterns that do not straightforwardly correlate with depth, offering new insights into the layer-specific roles in RoPE-based MMDiT. Based on these findings, we propose a training-free, task-specific image editing framework that categorizes editing tasks into three types: position-dependent editing (e.g., object addition), content similarity-dependent editing (e.g., non-rigid editing), and region-preserved editing (e.g., background replacement). For each type, we design tailored key-value injection strategies based on the characteristics of the editing task. Extensive qualitative and quantitative evaluations demonstrate that our method outperforms state-of-the-art approaches, particularly in preserving original semantic content and achieving seamless modifications.
Poster
Yukuan Min · Muli Yang · Jinhao Zhang · Yuxuan Wang · Aming WU · Cheng Deng

[ Exhibit Hall I ]

Abstract
To promote the deployment of scenario understanding in the real world, Open-Vocabulary Scene Graph Generation (OV-SGG) has attracted much attention recently, aiming to generalize beyond the limited number of relation categories labeled during training and detect those unseen relations during inference. Towards OV-SGG, one feasible solution is to leverage the large-scale pre-trained vision-language models (VLMs) containing plentiful category-level content to capture accurate correspondences between images and text. However, due to the lack of quadratic relation-aware knowledge in VLMs, directly using the category-level correspondence in the base dataset could not sufficiently represent generalized relations involved in open world. Therefore, designing an effective open-vocabulary relation mining framework is challenging and meaningful. To this end, we propose a novel Vision-Language Interactive Relation Mining model (VL-IRM) for OV-SGG, which explores learning generalized relation-aware knowledge through multi-modal interaction. Specifically, first, to enhance the generalization of the relation text to visual content, we present a generative relation model to make the text modality explore possible open-ended relations based on visual content. Then, we employ visual modality to guide the relation text for spatial and semantic extension. Extensive experiments demonstrate the superior OV-SGG performance of our method.
Poster
Guanning Zeng · Xiang Zhang · Zirui Wang · Haiyang Xu · Zeyuan Chen · Bingnan Li · Zhuowen Tu

[ Exhibit Hall I ]

Abstract
We propose YOLO-Count, a new differentiable open-vocabulary object counting model that addresses both general counting challenges and enables training-free quantity control for text-to-image (T2I) generation. A key contribution is the `cardinality' map, a novel regression target designed to account for object size and location variations. By employing representation alignment and a hybrid supervision scheme, YOLO-Count minimizes the discrepancy between open-vocabulary counting and T2I generation control. The model's differentiable architecture facilitates gradient-based optimization for accurate object counts, leading to enhanced controllability and transparency in T2I systems. Our empirical evaluation demonstrates state-of-the-art counting accuracy and effective quantity control for the T2I generation tasks.
Poster
Chang Liu · Viraj Shah · Aiyu Cui · Svetlana Lazebnik

[ Exhibit Hall I ]

Abstract
This paper introduces UnZipLoRA, a method for decomposing an image into its constituent subject and style, represented as two distinct LoRAs (Low-Rank Adaptations). Unlike existing personalization techniques that focus on either subject or style in isolation, or require separate training sets for each, UnZipLoRA disentangles these elements from a single image by training both the LoRAs simultaneously. UnZipLoRA ensures that the resulting LoRAs are compatible, i.e., they can be seamlessly combined using direct addition. UnZipLoRA enables independent manipulation and recontextualization of subject and style, including generating variations of each, applying the extracted style to new subjects, and recombining them to reconstruct the original image or create novel variations. To address the challenge of subject and style entanglement, UnZipLoRA employs a novel prompt separation technique, as well as column and block separation strategies to accurately preserve the characteristics of subject and style, and ensure compatibility between the learned LoRAs. Evaluation with human studies and quantitative metrics demonstrates UnZipLoRA's effectiveness compared to other state-of-the-art methods, including DreamBooth-LoRA, Inspiration Tree, and B-LoRA.
Poster
U-Chae Jun · Jaeeun Ko · Jiwoo Kang

[ Exhibit Hall I ]

Abstract
We introduce a novel generative framework that unifies adversarial and diffusion-based training to overcome the limitations of conventional models. Our approach, termed Generative Adversarial Diffusion (GAD), integrates an adversarial loss directly into each denoising step of a latent diffusion model. By employing a single U-Net as a unified generator and discriminator, our framework eliminates the need for a separate discriminator, thereby reducing memory overhead and mitigating common GAN issues such as mode collapse and training instability. This integrated adversarial regularizer promotes semantic information exchange across timesteps, enabling the model to better capture complex data distributions even when training data is scarce or biased. Extensive experiments on standard latent diffusion benchmarks demonstrate that GAD significantly enhances image quality and mode coverage in tasks including text-to-image and image-to-3D generation. Our results suggest that unifying adversarial and diffusion-based training in a single network offers a promising new direction for high-fidelity, stable image synthesis.
Poster
Priyank Pathak · Yogesh Rawat

[ Exhibit Hall I ]

Abstract
Clothes-Changing Re-Identification (CC-ReID) aims to recognize individuals across different locations and times, irrespective of clothing changes. Existing approaches often rely on additional models or annotated attributes to learn robust, clothing-invariant features, making them resource-intensive. In contrast, we explore the use of color—specifically foreground and background colors—as a lightweight, annotation-free proxy for mitigating appearance bias in ReID models. We propose Colors See, Colors Ignore (CSCI), a RGB-only method that leverages color information directly from raw images or video frames. CSCI efficiently captures color-related appearance bias ('Color See') while disentangling it from identity-relevant ReID features ('Color Ignore'). To achieve this, we introduce \textbf{S2A self-attention}, a novel mechanism designed to separate color and identity cues within the feature space. Our analysis shows a strong correspondence between learned color embeddings and clothing attributes, validating color as an effective proxy when explicit clothing labels are unavailable. We demonstrate the effectiveness of CSCI on both image and video ReID with extensive experiments on four CC-ReID datasets. We improve baseline by Top-1 2.9% on LTCC and 5.0% on PRCC for image-based ReID baseline, and 1.0% on CCVID and 3.6% on MeVID for video-based ReID without relying on additional supervision. Our results highlight the potential of color as …
Poster
Yiming Gong · Zhen Zhu · Minjia Zhang

[ Exhibit Hall I ]

Abstract
We propose a fast text-guided image editing method called InstantEdit based on the RectifiedFlow framework, which is structured as a few-step editing process that preserves critical content while following closely to textual instructions. Our approach leverages the straight sampling trajectories of RectifiedFlow by introducing a specialized inversion strategy called PerRFI to enhance inversion accuracy. To seamlessly integrate PerRFI with our backbone RectifiedFlow model, we further propose a novel regeneration method, Inversion Latent Injection, which effectively reuses latent information obtained during inversion to facilitate more coherent and detailed regeneration. Additionally, we propose a Disentangled Prompt Guidance technique to balance editability with detail preservation, and integrate a Canny-conditioned ControlNet to incorporate structural cues and suppress artifacts. Evaluation on the PIE image editing dataset demonstrates that InstantEdit is not only fast but also achieves better qualitative and quantitative results compared to state-of-the-art few-step editing methods.
Poster
Yanzuo Lu · Yuxi Ren · Xin Xia · Shanchuan Lin · XING WANG · Xuefeng Xiao · Jinhua Ma · Xiaohua Xie · Jianhuang Lai

[ Exhibit Hall I ]

Abstract
Distribution Matching Distillation (DMD) is a promising score distillation technique that compresses pre-trained teacher diffusion models into efficient one-step or multi-step student generators.Nevertheless, its reliance on the reverse Kullback-Leibler (KL) divergence minimization potentially induces mode collapse (or mode-seeking) in certain applications.To circumvent this inherent drawback, we propose **Adversarial Distribution Matching (ADM)**, a novel framework that leverages diffusion-based discriminators to align the latent predictions between real and fake score estimators for score distillation in an adversarial manner.In the context of extremely challenging one-step distillation, we further improve the pre-trained generator by adversarial distillation with hybrid discriminators in both latent and pixel spaces.Different from the mean squared error used in DMD2 pre-training, our method incorporates the distributional loss on ODE pairs collected from the teacher model, and thus providing a better initialization for score distillation fine-tuning in the next stage.By combining the adversarial distillation pre-training with ADM fine-tuning into a unified pipeline termed **DMDX**, our proposed method achieves superior one-step performance on SDXL compared to DMD2 while consuming less GPU time.Additional experiments that apply multi-step ADM distillation on SD3-Medium, SD3.5-Large, and CogVideoX set a new benchmark towards efficient image and video synthesis.
Poster
Bowen Fu · Wei Wei · Jiaqi Tang · Jiangtao Nie · Yanyu Ye · Xiaogang Xu · Ying-Cong Chen · Lei Zhang

[ Exhibit Hall I ]

Abstract
Controllable diffusion models have been widely applied in image stylization. However, existing methods often treat the style in the reference image as a single, indivisible entity, which makes it difficult to transfer specific stylistic attributes. To address this issue, we propose a fine-grained controllable image stylization framework, Co-Painter, to decouple multiple attributes embedded in the reference image and adaptively inject it into the diffusion model. We first build a multi-condition image stylization framework based on the text-to-image generation model. Then, to drive it, we develop a fine-grained decoupling mechanism to implicitly separate the attributes from the image. Finally, we design a gated feature injection mechanism to adaptively regulate the importance of multiple attributes. To support the above procedure, we also build a dataset with fine-grained styles. It comprises nearly 48,000 image-text pairs samples. Extensive experiments demonstrate that the proposed model achieves an optimal balance between text alignment and style similarity to reference images, both in standard and fine-grained settings.
Poster
Haowen Li · Zhenfeng Fan · Zhang Wen · Zhengzhou Zhu · Yunjin Li

[ Exhibit Hall I ]

Abstract
Image composition has advanced significantly with large-scale pre-trained T2I diffusion models. Despite progress in same-domain composition, cross-domain composition remains under-explored. The main challenges are the stochastic nature of diffusion models and the style gap between input images, leading to failures and artifacts. Additionally, heavy reliance on text prompts limits practical applications.This paper presents the first cross-domain image composition method that does not require text prompts, allowing natural stylization and seamless compositions. Our method is efficient and robust, preserving the diffusion prior, as it involves minor steps after initial image blending without additional interference in the diffusion process. Our method uses a multilayer perceptron to integrate CLIP features from foreground and background images, manipulating diffusion steps with a cross-attention strategy. It effectively preserves foreground content while enabling stable stylization without a pre-stylization network. We also create a benchmark dataset with diverse contents and styles for fair evaluation, addressing the lack of testing datasets for cross-domain image composition.Our method outperforms state-of-the-art techniques in both qualitative and quantitative evaluations, reducing LPIPS scores by $30.5$\% and improving CSD metrics by $18.1$\%. We believe our method will advance future research and applications. The code and benchmark will be publicly available.
Poster
XINQI LYU · Yihao LIU · Yanjie Li · Bin Xiao

[ Exhibit Hall I ]

Abstract
Text-to-Image (T2I) models have gained widespread adoption across various applications. Despite the success, the potential misuse of T2I models poses significant risks of generating Not-Safe-For-Work (NSFW) content. To investigate the vulnerability of T2I models, this paper delves into adversarial attacks to bypass the safety mechanisms under black-box settings. Most previous methods rely on word substitution to search adversarial prompts. Due to limited search space, this leads to suboptimal performance compared to gradient-based training. However, black-box settings present unique challenges to training gradient-driven attack methods, since there is no access to the internal architecture and parameters of T2I models. To facilitate the learning of adversarial prompts in black-box settings, we propose a novel prompt learning attack framework (PLA), where insightful gradient-based training tailored to black-box T2I models is designed by utilizing multimodal similarities. Experiments show that our new method can effectively attack the safety mechanisms of black-box T2I models including prompt filters and post-hoc safety checkers with a high success rate compared to state-of-the-art methods.Warning: This paper may contain offensive model-generated content.
Poster
Jingye Chen · Zhaowen Wang · Nanxuan Zhao · Li Zhang · Difan Liu · Jimei Yang · Qifeng Chen

[ Exhibit Hall I ]

Abstract
Graphic design is crucial for conveying ideas and messages. Designers usually organize their work into objects, backgrounds, and vectorized text layers to simplify editing. However, this workflow demands considerable expertise. With the rise of GenAI methods, an endless supply of high-quality graphic designs in pixel format has become more accessible, though these designs often lack editability. Despite this, non-layered designs still inspire human designers, influencing their choices in layouts and text styles, ultimately guiding the creation of layered designs. Motivated by this observation, we propose Accordion, a graphic design generation framework taking the first attempt to convert AI-generated designs into editable layered designs, meanwhile refining nonsensical AI-generated text with meaningful alternatives guided by user prompts. It is built around a vision language model (VLM) playing distinct roles in three curated stages: (1) reference creation, (2) design planning, and (3) layer generation. For each stage, we design prompts to guide the VLM in executing different tasks. Distinct from existing bottom-up methods (e.g., COLE and Open-COLE) that gradually generate elements to create layered designs, our approach works in a top-down manner by using the visually harmonious reference image as global guidance to decompose each layer. Additionally, it leverages multiple vision experts such …
Poster
Jie Shao · Hanxiao Zhang · Hao Yu · Jianxin Wu

[ Exhibit Hall I ]

Abstract
The rapid progress in generative models has significantly enhanced the quality of image generation. However, as these models grow larger, deploying and fine-tuning them becomes increasingly challenging. While conventional quantization techniques help reduce model size, they struggle to achieve high compression rates without significant performance loss. As a result, memory footprint remains a critical challenge for generative models. In this work, we explore the extreme compression of generative models through codebook quantization, drastically reducing model size while maintaining performance. We extend product quantization for model compression, significantly increasing codebook capacity, which is crucial for preserving the generative quality of diffusion models. We also introduce a codebook compression method for memory efficiency. To further minimize performance degradation, we develop EM calibration with re-initialization that optimizes both assignments and centroids. By compressing the model to as low as 1 bit (achieving a 13$\times$ reduction in model size), we obtain a highly compact generative model with remarkable image quality. Extensive experiments on ImageNet demonstrate the superiority of our method over existing techniques. Furthermore, we validate its effectiveness across various generation, language and 3D tasks, highlighting its broad applicability and robust performance.
Poster
Hengyu Meng · Duotun Wang · Zhijing Shao · Ligang Liu · Zeyu Wang

[ Exhibit Hall I ]

Abstract
Professional 3D asset creation often requires diverse sculpting brushes to add surface details and geometric structures.Despite recent progress in 3D generation, producing reusable sculpting brushes compatible with artists' workflows remains an open and challenging problem.These sculpting brushes are typically represented as vector displacement maps (VDMs), which existing models cannot easily generate compared to natural images.This paper presents Text2VDM, a novel framework for text-to-VDM brush generation through the deformation of a dense planar mesh guided by score distillation sampling (SDS).The original SDS loss is designed for generating full objects and struggles with generating desirable sub-object structures from scratch in brush generation.We refer to this issue as semantic coupling, which we address by introducing weighted blending of prompt tokens to SDS, resulting in a more accurate target distribution and semantic guidance.Experiments demonstrate that Text2VDM can generate diverse, high-quality VDM brushes for sculpting surface details and geometric structures.Our generated brushes can be seamlessly integrated into mainstream modeling software, enabling various applications such as mesh stylization and real-time interactive modeling.
Poster
Haonan Qiu · Shiwei Zhang · Yujie Wei · Ruihang Chu · Hangjie Yuan · Xiang Wang · Yingya Zhang · Ziwei Liu

[ Exhibit Hall I ]

Abstract
Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. To tackle this challenge, we propose $\textbf{FreeScale}$, a tuning-free inference paradigm to enable higher-resolution visual generation via scale fusion. Specifically, FreeScale processes information from different receptive scales and then fuses it by extracting desired frequency components. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Notably, compared with the previous best-performing method, FreeScale unlocks the $\textbf{8k}$-resolution text-to-image generation for the first time.
Poster
Yiren Song · Xiaokang Liu · Mike Zheng Shou

[ Exhibit Hall I ]

Abstract
Diffusion models have fundamentally transformed the field of generative models, making the assessment of similarity between customized model outputs and reference inputs critically important. However, traditional perceptual similarity metrics operate primarily at the pixel and patch levels, comparing low-level colors and textures but failing to capture mid-level similarities and differences in image layout, object pose, and semantic content. Contrastive learning-based CLIP and self-supervised learning-based DINO are often used to measure semantic similarity, but they highly compress image features, inadequately assessing appearance details. This paper is the first to discover that pretrained diffusion models can be utilized for measuring visual similarity and introduces the DiffSim method, addressing the limitations of traditional metrics in capturing perceptual consistency in custom generation tasks. By aligning features in the attention layers of the denoising U-Net, DiffSim evaluates both appearance and style similarity, showing superior alignment with human visual preferences. Additionally, we introduce the Sref and IP benchmarks to evaluate visual similarity at the level of style and instance, respectively. Comprehensive evaluations across multiple benchmarks demonstrate that DiffSim achieves state-of-the-art performance, providing a robust tool for measuring visual coherence in generative models.
Poster
Anlin Zheng · Haochen Wang · Yucheng Zhao · Weipeng DENG · Tiancai Wang · Xiangyu Zhang · Xiaojuan Qi

[ Exhibit Hall I ]

Abstract
The vanilla autoregressive image generation model generates visual tokens in a step-by-step fashion, which limits the ability to capture holistic relationships among token sequences. Moreover, most visual tokenizers map local image patches into latent tokens, leading to limited global information. To address this, we introduce \textit{Hita}, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Besides, Hita incorporates two key strategies for improved alignment with the AR generation process: 1) it arranges a sequential structure with holistic tokens at the beginning followed by patch-level tokens while using causal attention to maintain awareness of previous tokens; and 2) before feeding the de-quantized tokens into the decoder, Hita adopts a lightweight fusion module to control information flow to prioritize holistic tokens. Extensive experiments show that Hita accelerates the training speed of AR generators and outperforms those trained with vanilla tokenizers, achieving \textbf{2.59 FID} and \textbf{281.9 IS} on the ImageNet benchmark. A detailed analysis of the holistic representation highlights its ability to capture global image properties such as textures, materials, and shapes. Additionally, Hita also demonstrates effectiveness in zero-shot style transfer and image in-painting. The code will be publicly …
Poster
Azim Ospanov · Mohammad Jalali · Farzan Farnia

[ Exhibit Hall I ]

Abstract
The use of CLIP embeddings to assess the alignment of samples produced by text-to-image generative models has been extensively explored in the literature. While the widely adopted CLIPScore, derived from the cosine similarity of text and image embeddings, effectively measures the relevance of a generated image, it does not quantify the diversity of images generated by a text-to-image model. In this work, we extend the application of CLIP embeddings to quantify and interpret the intrinsic diversity of text-to-image models, which is responsible for generating diverse images from similar text prompts. To achieve this, we propose a decomposition of the CLIP-based kernel covariance matrix of image data into text-based and non-text-based components. Using the Schur complement of the joint image-text kernel covariance matrix, we perform this decomposition and define the matrix-based entropy of the decomposed component as the *Schur Complement Entropy (SCE)* score, a measure of the intrinsic diversity of a text-to-image model based on data collected with varying text prompts. Additionally, we demonstrate the use of the Schur complement-based decomposition to nullify the influence of a given prompt in the CLIP embedding of an image, enabling focus or defocus of embeddings on specific objects or properties for downstream tasks. We …
Poster
Xuan Han · Yihao Zhao · Yanhao Ge · Mingyu You

[ Exhibit Hall I ]

Abstract
With its extensive applications, Foreground Conditioned Out-painting (FCO) has attracted considerable attention in the research field. Through the utilization of text-driven FCO, users are enabled to generate diverse backgrounds for a given foreground by adjusting the text prompt, which considerably enhances the efficiency in fields like e-commerce. Since the foreground is fixed in FCO, a key concern is whether the generated background can match the foreground well to achieve a coherent composition. However, most existing methods are lacking in this regard. Artifacts and incorrect interactions are common defects in synthesized images. This issue is linked to the influence of the initial noise in the sampling process. As the initial noise is sampled independently, it's highly likely that the implied image composition will conflict with the given foreground. In this paper, a novel Initialization Policy Model (IPM) is proposed to address this problem. Its function is to replace the early denoising steps and directly predict the intermediate state that is conducive to the reasonable image composition. Since the IPM is designed to take only the foreground image and the text prompt as inputs, it isolates the impact of the initial noise. The subsequently proposed training paradigm that combines inversion-derived label supervision …
Poster
Pedro Vélez · Luisa Polania Cabrera · Yi Yang · Chuhan Zhang · Rishabh Kabra · Anurag Arnab · Mehdi S. M. Sajjadi

[ Exhibit Hall I ]

Abstract
Diffusion models have revolutionized generative modeling, enabling unprecedented realism in image and video synthesis.This success has sparked interest in leveraging their representations for visual understanding tasks. While recent works have explored this potential for image generation, the visual understanding capabilities of video diffusion models remain largely uncharted. To address this gap, we analyze the performance of latent image and video diffusion representations on various downstream tasks including image classification, action recognition, depth estimation, and tracking. For the most informative comparison, we utilize the same model architecture, WALT, across image and video generation objectives. Our results show that video generation pre-training consistently outperforms its image counterpart, though we find a striking range in the extent of this superiority. We further analyze features extracted from different layers and with varying noise levels, as well as the effect of model size and training budget on representation and generation quality. This work marks the first direct comparison of video and image diffusion objectives for visual understanding, offering insights into the role of temporal information in representation learning.
Poster
Zhe Ma · Qingming Li · Xuhong Zhang · Tianyu Du · Ruixiao Lin · Zonghui Wang · Shouling Ji · Wenzhi CHEN

[ Exhibit Hall I ]

Abstract
The past few years have witnessed substantial advances in image generation powered by diffusion models. However, it was shown that diffusion models are susceptible to training data memorization, raising significant concerns regarding copyright infringement and privacy invasion. This study delves into a rigorous analysis of memorization in diffusion models. We introduce InvMM, an inversion-based measure of memorization, which is based on inverting an sensitive latent noise distribution accounting for the replication of an image. For accurate estimation of the measure, we propose an adaptive algorithm that balances the normality and sensitivity of the noise distribution. Comprehensive experiments across four datasets, conducted on both unconditional and text-guided diffusion models, demonstrate that InvMM provides a reliable and complete quantification of memorization. Notably, InvMM is commensurable between samples, reveals the true extent of memorization from an adversarial standpoint and implies how memorization differs from membership. In practice, it serves as an auditing tool for developers to reliably assess the risk of memorization, thereby contributing to the enhancement of trustworthiness and privacy-preserving capabilities of diffusion models.
Poster
Aysan Aghazadeh · Adriana Kovashka

[ Exhibit Hall I ]

Abstract
We address the task of advertisement image generation and introduce three evaluation metrics to assess Creativity, prompt Alignment, and Persuasiveness (CAP) in generated advertisement images. Despite recent advancements in Text-to-Image (T2I) methods and their performance in generating high-quality images for explicit descriptions, evaluating these models remains challenging. Existing evaluation methods focus largely on assessing alignment with explicit, detailed descriptions, but evaluating alignment with visually implicit prompts remains an open problem. Additionally, creativity and persuasiveness are essential qualities that enhance the effectiveness of advertisement images, yet are seldom measured. To address this, we propose three novel metrics for evaluating the creativity, alignment, and persuasiveness of generated images. We show that current T2I models struggle with creativity, persuasiveness, and alignment when the input text is implicit messages. We further introduce a simple yet effective approach to enhance T2I models' capabilities in producing images that are better aligned, more creative, and more persuasive.
Poster
Zuhao Yang · Jiahui Zhang · Yingchen Yu · Shijian Lu · Song Bai

[ Exhibit Hall I ]

Abstract
Leveraging text, images, structure maps, or motion trajectories as conditional guidance, diffusion models have achieved great success in automated and high‑quality video generation. However, generating smooth and rational transition videos given the first and last video frames as well as descriptive text prompts is far underexplored. We present VTG, a Versatile Transition video Generation framework that can generate smooth, high‑fidelity, and semantic‑coherent video transitions. VTG introduces interpolation-based initialization that helps preserve object identity and handle abrupt content changes effectively. In addition, it incorporates dual-directional motion fine‑tuning and representation alignment regularization to mitigate the limitations of pre‑trained image‑to‑video diffusion models in motion smoothness and generation fidelity, respectively. To evaluate VTG and facilitate future studies on unified transition generation, we collected TransitBench, a comprehensive benchmark for transition generation covering two representative transition tasks: concept blending and scene transition. Extensive experiments show that VTG achieves superior transition performance consistently across all four tasks.
Poster
JianHui Zhang · Shen Cheng · Qirui Sun · Jia Liu · Wang Luyang · chaoyu feng · Chen Fang · LEI LEI · Jue Wang · Shuaicheng Liu

[ Exhibit Hall I ]

Abstract
In this work, we present Patch-Adapter, an effective framework for high-resolution text-guided image inpainting. Unlike existing methods limited to lower resolutions, our approach achieves 4K+ resolution while maintaining precise content consistency and prompt alignment—two critical challenges in image inpainting that intensify with increasing resolution and texture complexity.Patch-Adapter leverages a two-stage adapter architecture to scale the Diffusion models's resolution from 1K to 4K+ without requiring structural overhauls:(1)Dual Context Adapter: Learns coherence between masked and unmasked regions at reduced resolutions to establish global structural consistency.(2)Reference Patch Adapter: Implements a patch-level attention mechanism for full-resolution inpainting, preserving local detail fidelity through adaptive feature fusion.This dual-stage architecture uniquely addresses the scalability gap in high-resolution inpainting by decoupling global semantics from localized refinement. Experiments demonstrate that Patch-Adapter not only resolves artifacts common in large-scale inpainting but also achieves state-of-the-art performance on the OpenImages and photo-concept-bucket datasets, outperforming existing methods in both perceptual quality and text-prompt adherence. The code will be open-sourced.
Poster
Shengbang Tong · David Fan · Jiachen Zhu · Yunyang Xiong · Xinlei Chen · Koustuv Sinha · Michael Rabbat · Yann LeCun · Saining Xie · Zhuang Liu

[ Exhibit Hall I ]

Abstract
In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data curated in an instruction-following format. Our empirical investigation reveals several intriguing properties of VPiT: (1) visual generation ability emerges as a natural byproduct of improved visual understanding, and can be unlocked efficiently with a small amount of generation data; (2) while we find understanding and generation to be mutually beneficial, understanding data contributes to both capabilities more effectively than generation data. Building upon these findings, we train our MetaMorph model and achieve competitive performance on both visual understanding and generation. In visual generation, MetaMorph can leverage the world knowledge and reasoning abilities gained from LLM pretraining, and overcome common failure modes exhibited by other generation models. Our results suggest that LLMs may have strong "prior" vision capabilities that can be efficiently adapted to both visual understanding and generation with a relatively simple instruction tuning process.
Poster
Quang-Binh Nguyen · Minh Luu · Quang Nguyen · Anh Tran · Khoi Nguyen

[ Exhibit Hall I ]

Abstract
Disentangling content and style from a single image, known as content-style decomposition (CSD), enables recontextualization of extracted content and stylization of extracted styles, offering greater creative flexibility in visual synthesis. While recent personalization methods have explored explicit content-style decomposition, they remain tailored for diffusion models. Meanwhile, Visual Autoregressive Modeling (VAR) has emerged as a promising alternative with a next-scale prediction paradigm, achieving performance on par with diffusion models. In this paper, we explore VAR as a generative framework for CSD, leveraging its scale-wise generation process for improved disentanglement. To this end, we propose CSD-VAR, a novel method that introduces three key innovations: (1) a scale-aware alternating optimization strategy that aligns content and style representation with their respective scales to enhance separation, (2) an SVD-based rectification method to mitigate content leakage into style representations, and (3) Augmented Key-Value (K-V) memory enhancing content identity preservation. To benchmark this task, we introduce CSD-100, a dataset specifically designed for content-style decomposition, featuring diverse subjects rendered in various artistic styles. Experiments demonstrate that CSD-VAR outperforms prior approaches, achieving superior content preservation and stylization fidelity
Poster
Runtao Liu · I Chen · Jindong Gu · Jipeng Zhang · Renjie Pi · Qifeng Chen · Philip Torr · Ashkan Khakzar · Fabio Pizzati

[ Exhibit Hall I ]

Abstract
Text-to-image (T2I) models have become widespread, but their limited safety guardrails expose end users to harmful content and potentially allow for model misuse. Current safety measures are typically limited to text-based filtering or concept removal strategies, able to remove just a few concepts from the model's generative capabilities. In this work, we introduce AlignGuard, a method for safety alignment of T2I models. We enable the application of Direct Preference Optimization (DPO) for safety purposes in T2I models by synthetically generating a dataset of harmful and safe image-text pairs, which we call CoProV2. Using a custom DPO strategy and this dataset, we train safety experts, in the form of low-rank adaptation (LoRA) matrices, able to guide the generation process away from specific safety-related concepts. Then, we merge the experts into a single LoRA using a novel merging strategy for optimal scaling performance. This expert-based approach enables scalability, allowing us to remove 7 times more harmful concepts from T2I models compared to baselines. AlignGuard consistently outperforms the state-of-the-art on many benchmarks and establishes new practices for safety alignment in T2I networks. We will release code and models.
Poster
Zhuoling Li · Haoxuan Qu · Jason Kuen · Jiuxiang Gu · Qiuhong Ke · Jun Liu · Hossein Rahmani

[ Exhibit Hall I ]

Abstract
Intellectual property (IP) protection for diffusion models is a critical concern, given the significant resources and time required for their development. To effectively safeguard the IP of diffusion models, a key step is enabling the comparison of unique identifiers (fingerprints) between suspect and victim models. However, performing robust and effective fingerprint comparisons among diffusion models remains an under-explored challenge, particularly for diffusion models that have already been released. To address this, in this work, we propose \textbf{DiffIP}, a novel framework for robust and effective fingerprint comparison between suspect and victim diffusion models. Extensive experiments demonstrate the efficacy of our framework.
Poster
Yuxuan Wang · Tianwei Cao · Huayu Zhang · Zhongjiang He · Kongming Liang · Zhanyu Ma

[ Exhibit Hall I ]

Abstract
Image generation has achieved remarkable progress with the development of large-scale text-to-image models, especially diffusion-based models. However, generating human images with plausible details, such as faces or hands, remains challenging due to insufficient supervision of local regions during training. To address this issue, we propose FairHuman, a multi-objective fine-tuning approach designed to enhance both global and local generation quality fairly. Specifically, we first construct three learning objectives: a global objective derived from the default diffusion objective function and two local objectives for hands and faces based on pre-annotated positional priors. Subsequently, we derive the optimal parameter updating strategy under the guidance of the Minimum Potential Delay (MPD) criterion, thereby attaining fairness-ware optimization for this multi-objective problem. Based on this, our proposed method can achieve significant improvements in generating challenging local details while maintaining overall quality. Extensive experiments showcase the effectiveness of our method in improving the performance of human image generation under different scenarios.
Poster
Ryan Ramos · Vladan Stojnić · Giorgos Kordopatis-Zilos · Yuta Nakashima · Giorgos Tolias · Noa Garcia

[ Exhibit Hall I ]

Abstract
Prior work has analyzed the robustness of deep models to image transformations and corruptions, particularly in cases where such alterations are not seen during training. When this occurs, they introduce a form of distribution shift at test time, often leading to performance degradation. The primary focus has been on severe corruptions that, when applied aggressively, distort useful signals necessary for accurate semantic predictions.We take a different perspective by analyzing parameters of the image acquisition process and transformations that may be subtle or even imperceptible to the human eye. We find that such parameters are systematically encoded in the learned visual representations and can be easily recovered. More strikingly, their presence can have a profound impact—either positively or negatively—on semantic predictions. This effect depends on whether there is a strong correlation or anti-correlation between semantic labels and these acquisition-based or processing-based labels.
Poster
Yilin Wang · Zunlei Feng · Jiachi Wang · Hengrui Lou · Binjia Zhou · Jie Lei · Mingli Song · Yijun Bei

[ Exhibit Hall I ]

Abstract
The rapid development of AIGC technology has enabled highly realistic forged images to deceive human perception, posing serious risks across many areas. Current deepfake image detection methods primarily identify forgeries by extracting handcrafted features, deep features, and frequency-domain features. While these features contain forgery traces, they also include a substantial amount of the image's semantic information, which interferes with the precision and generalization of forgery detection models. To tackle these challenges, this paper introduces a novel forgery image identification method based on the Spatial-Temporal Forgery Trace (STFT). Motivated by the fact that forgery images are more easily fitted to a specific distribution than real images, the STFT method approaches the issue from a forged image distribution modeling perspective, employing generative diffusion models to meticulously capture the temporal distribution of images. It further models the relationship between temporal feature variations and spatially corresponding temporal features, treating them as temporal and spatial forgery traces. Moreover, STFT incorporates frequency-domain features as weighting factors to accelerate the localization of spatio-temporal forgery traces. Experiments demonstrate that by integrating spatial, temporal, and frequency perspectives within the latent space, STFT effectively captures subtle spatio-temporal forgery traces, exhibiting strong robustness and generalizability. It outperforms state-of-the-art methods on major …
Poster
Siyoon Jin · Jisu Nam · Jiyoung Kim · Dahyun Chung · Yeong-Seok Kim · Joonhyung Park · HeonJeong Chu · Seungryong Kim

[ Exhibit Hall I ]

Abstract
Exemplar-based semantic image synthesis generates images aligned with semantic content while preserving the appearance of an exemplar. Conventional structure-guidance models like ControlNet, are limited as they rely solely on text prompts to control appearance and cannot utilize exemplar images as input. Recent tuning-free approaches address this by transferring local appearance via implicit cross-image matching in the augmented self-attention mechanism of pre-trained diffusion models. However, prior works are often restricted to single-object cases or foreground object appearance transfer, struggling with complex scenes involving multiple objects. To overcome this, we propose AM-Adapter (Appearance Matching Adapter) to address exemplar-based semantic image synthesis in-the-wild, enabling multi-object appearance transfer from a single scene-level image. AM-Adapter automatically transfers local appearances from the scene-level input. AM-Adapter alternatively provides controllability to map user-defined object details to specific locations in the synthesized images. Our learnable framework enhances cross-image matching within augmented self-attention by integrating semantic information from segmentation maps. To disentangle generation and matching, we adopt stage-wise training. We first train the structure-guidance and generation networks, followed by training the matching adapter while keeping the others frozen. During inference, we introduce an automated exemplar retrieval method for selecting exemplar image-segmentation pairs efficiently. Despite utilizing minimal learnable parameters, AM-Adapter achieves …
Poster
Zhikai Chen · Fuchen Long · Zhaofan Qiu · Ting Yao · Wengang Zhou · Jiebo Luo · Tao Mei

[ Exhibit Hall I ]

Abstract
Recent advances in video generation have demonstrated the utility of powerful diffusion models. One important direction among them is to enhance the visual quality of the AI-synthesized videos for artistic creation. Nevertheless, solely relying on the knowledge embedded in the pre-trained video diffusion models might limit the generalization ability of local details (e.g., texture). In this paper, we address this issue by exploring the visual cues from a high-quality (HQ) image reference to facilitate visual details generation in video enhancement. We present GenVE, a new recipe of generative video enhancement framework that pursues the semantic and texture alignment between HQ image reference and denoised video in diffusion. Technically, GenVE first leverages an image diffusion model to magnify a key frame of the input video to attain a semantics-aligned HQ image reference. Then, a video controller is integrated into 3D-UNet to capture patch-level texture of the image reference to enhance fine-grained details generation at the corresponding region of low-quality (LQ) video. Moreover, a series of conditioning augmentation strategies are implemented for effective model training and algorithm robustness. Extensive experiments conducted on the public YouHQ40 and VideoLQ, as well as self-built AIGC-Vid dataset, quantitatively and qualitatively demonstrate the efficacy of our GenVE …
Poster
Yingsong Huang · Hui Guo · Jing Huang · Bing Bai · Qi Xiong

[ Exhibit Hall I ]

Abstract
The rapid progress of diffusion models highlights the growing need for detecting generated images. Previous research demonstrates that incorporating diffusion-based measurements, such as reconstruction error, can enhance the generalizability of detectors. However, ignoring the differing impacts of aleatoric and epistemic uncertainty on reconstruction error can undermine detection performance. Aleatoric uncertainty, arising from inherent data noise, creates ambiguity that impedes accurate detection of generated images. As it reflects random variations within the data (e.g., noise in natural textures), it does not help distinguish generated images. In contrast, epistemic uncertainty, which represents the model's lack of knowledge about unfamiliar patterns, supports detection. In this paper, we propose a novel framework, Diffusion Epistemic Uncertainty with Asymmetric Learning (DEUA), for detecting diffusion-generated images. We introduce Diffusion Epistemic Uncertainty (DEU) estimation via the Laplace approximation to assess the proximity of data to the manifold of diffusion-generated samples. Additionally, an asymmetric loss function is introduced to train a balanced classifier with larger margins, further enhancing generalizability. Extensive experiments on large-scale benchmarks validate the state-of-the-art performance of our method.
Poster
Lingxiao Li · Kaixuan Fan · Boqing Gong · Xiangyu Yue

[ Exhibit Hall I ]

Abstract
Few-shot image generation aims to generate diverse and high-quality images for an unseen class given only a few examples in that class. However, existing methods often suffer from a trade-off between image quality and diversity while offering limited control over the attributes of newly generated images. In this work, we propose Hyperbolic Diffusion Autoencoders (HypDAE), a novel approach that operates in hyperbolic space to capture hierarchical relationships among images from seen categories. By leveraging pre-trained foundation models, HypDAE generates diverse new images for unseen categories with exceptional quality by varying semantic codes. Most importantly, the hyperbolic representation introduces an additional degree of control over semantic diversity through the adjustment of radii within the hyperbolic disk. Extensive experiments and visualizations demonstrate that HypDAE significantly outperforms prior methods by achieving a superior balance between quality and diversity with limited data and offers a highly controllable and interpretable generation process.
Poster
Rui Xie · Yinhong Liu · Penghao Zhou · Chen Zhao · Jun Zhou · Kai Zhang · Zhenyu Zhang · Jian Yang · Zhenheng Yang · Ying Tai

[ Exhibit Hall I ]

Abstract
Image diffusion models have been adapted for real-world video super-resolution to tackle over-smoothing issues in GAN-based methods. However, these models struggle to maintain temporal consistency, as they are trained on static images, limiting their ability to capture temporal dynamics effectively. Integrating text-to-video (T2V) models into video super-resolution for improved temporal modeling is straightforward. However, two key challenges remain: artifacts introduced by complex degradations in real-world scenarios, and compromised fidelity due to the strong generative capacity of powerful T2V models (e.g., CogVideoX-5B). To enhance the spatio-temporal quality of restored videos, we introduce STAR (Spatial-Temporal Augmentation with T2V models for Real-world video super-resolution), a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency. Specifically, we introduce a Local Information Enhancement Module (LIEM) before the global attention block to enrich local details and mitigate degradation artifacts. Moreover, we propose a Dynamic Frequency (DF) Loss to reinforce fidelity, guiding the modelto focus on different frequency components across diffusion steps. Extensive experiments demonstrate STAR outperforms state-of-the-art methods on both synthetic and real-world datasets.
Poster
Jingwei Liu · Ling Yang · Hao Luo · Fan Wang · Hongyan Li · Mengdi Wang

[ Exhibit Hall I ]

Abstract
The paper-to-video task converts a research paper into a structured video abstract, distilling key concepts, methods, and conclusions into an accessible, well-organized format. While state-of-the-art video generation models demonstrate potential, they are constrained by limited LLM context windows, rigid video duration constraints, limited stylistic diversity, and an inability to represent domain-specific knowledge. To address these limitations, we introduce Preacher, the first paper-to-video agentic system. Preacher employs a top-down approach to decompose, summarize, and reformulate the paper, followed by bottom-up video generation, synthesizing diverse video segments into a coherent abstract. To align cross-modal representations, we define key scenes and introduce a Progressive Chain of Thought (P-CoT) for granular, iterative planning. Preacher successfully generates high-quality video abstracts across five research fields, demonstrating expertise beyond current video generation models.
Poster
Zhuokun Chen · Jugang Fan · Zhuowei Yu · Bohan Zhuang · Mingkui Tan

[ Exhibit Hall I ]

Abstract
Visual autoregressive modeling, based on the next-scale prediction paradigm, exhibits notable advantages in image quality and model scalability over traditional autoregressive and diffusion models. It generates images by progressively refining resolution across multiple stages. However, the computational overhead in high-resolution stages remains a critical challenge due to the substantial number of tokens involved. In this paper, we introduce SparseVAR, a plug-and-play acceleration framework for next-scale prediction that dynamically excludes low-frequency tokens during inference without requiring additional training. Our approach is motivated by the observation that tokens in low-frequency regions have a negligible impact on image quality in high-resolution stages and exhibit strong similarity with neighboring tokens. Additionally, we observe that different blocks in the next-scale prediction model focus on distinct regions, with some concentrating on high-frequency areas. SparseVAR leverages these insights by employing lightweight MSE-based metrics to identify low-frequency tokens while preserving the fidelity of excluded regions through a small set of uniformly sampled anchor tokens. By significantly reducing the computational cost while maintaining high image generation quality, SparseVAR achieves notable acceleration in both HART and Infinity. Specifically, SparseVAR achieves up to a 2× speedup with minimal quality degradation in Infinity-2B.
Poster
Xuran Ma · Yexin Liu · Yaofu LIU · Xianfeng Wu · Mingzhe Zheng · Zihao Wang · Ser-Nam Lim · Harry Yang

[ Exhibit Hall I ]

Abstract
Video generation using diffusion models has shown remarkable progress, yet it remains computationally expensive due to the repeated processing of redundant features across blocks and steps. To address this, we propose a novel adaptive feature reuse mechanism that dynamically identifies and caches the most informative features by focusing on foreground and caching more on background, significantly reducing computational overhead with less sacrificing video quality. By leveraging the step and block caching, our method achieves up to 1.8× speed up on HunyuanVideo while maintaining competitive performance on Vbench, PSNR, SSIM, FID and LPIPS. Extensive experiments demonstrate that our approach not only improves efficiency but also enhances the quality of generated videos. The proposed method is generalizable and can be integrated into existing diffusion transformer frameworks.
Poster
Tsu-Jui Fu · Yusu Qian · Chen Chen · Wenze Hu · Zhe Gan · Yinfei Yang

[ Exhibit Hall I ]

Abstract
Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights. UniVG treats multi-modal inputs as unified conditions to enable various downstream applications, ranging from T2I generation, inpainting, instruction-based editing, identity-preserving generation, and layout-guided generation, to depth estimation and referring segmentation. Through comprehensive empirical studies on data mixing and multi-task training, we provide detailed insights into the training processes and decisions that inform our final designs. For example, we show that T2I generation and other tasks, such as instruction-based editing, can coexist without performance trade-offs, while auxiliary tasks like depth estimation and referring segmentation enhance image editing. Notably, our model can even outperform some task-specific models on their respective benchmarks, marking a significant step towards a unified image generation model.
Poster
Hyungjin Kim · Seokho Ahn · Young-Duk Seo

[ Exhibit Hall I ]

Abstract
Personalized generation in T2I diffusion models aims to naturally incorporate individual user preferences into the generation process with minimal user intervention. However, existing studies primarily rely on prompt-level modeling with large-scale models, often leading to inaccurate personalization due to the limited input token capacity of T2I diffusion models. To address these limitations, we propose DrUM, a novel method that integrates user profiling with a transformer-based adapter to enable personalized generation through condition-level modeling in the latent space. DrUM demonstrates strong performance on large-scale datasets and seamlessly integrates with open-source text encoders, making it compatible with widely used foundation T2I models without requiring additional fine-tuning.
Poster
Carlos Esteves · Mohammed Suhail · Ameesh Makadia

[ Exhibit Hall I ]

Abstract
Image tokenizers map images to sequences of discrete tokens, and are a crucial component of autoregressive transformer-based image generation. The tokens are typically associated with spatial locations in the input image, arranged in raster scan order, which is not ideal for autoregressive modeling. In this paper, we propose to tokenize the image spectrum instead, obtained from a discrete wavelet transform (DWT), such that the sequence of tokens represents the image in a coarse-to-fine fashion. Our tokenizer brings several advantages: 1) it leverages that natural images are more compressible at high frequencies, 2) it can take and reconstruct images of different resolutions without retraining, 3) it improves the conditioning for next-token prediction -- instead of conditioning on a partial line-by-line reconstruction of the image, it takes a coarse reconstruction of the full image, 4) it enables partial decoding where the first few generated tokens can reconstruct a coarse version of the image, 5) it enables autoregressive models to be used for image upsampling. We evaluate the tokenizer reconstruction metrics as well as multiscale image generation, text-guided image upsampling and editing.
Poster
Zeyinzi Jiang · Zhen Han · Chaojie Mao · Jingfeng Zhang · Yulin Pan · Yu Liu

[ Exhibit Hall I ]

Abstract
Diffusion Transformer has demonstrated powerful capability and scalability in generating high-quality images and videos. Further pursuing the unification of generation and editing tasks has yielded significant progress in the domain of image content creation.However, due to the intrinsic demands for consistency across both temporal and spatial dynamics, achieving a unified approach for video synthesis remains challenging.We introduce VACE, which enables users to perform Video tasks within an All-in-one framework for Creation and Editing.These tasks include reference-to-video generation, video-to-video editing, and masked video-to-video editing. Specifically, we effectively integrate the requirements of various tasks by organizing video task inputs, such as editing, reference, and masking, into a unified interface referred to as the Video Condition Unit (VCU).Furthermore, by utilizing a Context Adapter structure, we inject different task concepts into the model using formalized representations of temporal and spatial dimensions, allowing it to handle arbitrary video synthesis tasks flexibly.Extensive experiments demonstrate that the unified model of VACE achieves performance on par with task-specific models across various subtasks. Simultaneously, it enables diverse applications through versatile task combinations.
Poster
yifei feng · Mx Yang · Shuhui Yang · Sheng Zhang · Jiaao Yu · Zibo Zhao · Lliu Yuhong · Jie Jiang · Chunchao Guo

[ Exhibit Hall I ]

Abstract
Painting textures for existing geometries is a critical yet labor-intensive process in 3D asset generation. Recent advancements in text-to-image (T2I) models have led to significant progress in texture generation. Most existing research approaches this task by first generating images in 2D spaces using image diffusion models, followed by a texture baking process to achieve UV texture. However, these methods often struggle to produce high-quality textures due to inconsistencies among the generated multi-view images, resulting in seams and ghosting artifacts. In contrast, 3D-based texture synthesis methods aim to address these inconsistencies, but they often neglect 2D diffusion model priors, making them challenging to apply to real-world objectsTo overcome these limitations, we propose RomanTex, a multiview-based texture generation framework that integrates a multi-attention network with an underlying 3D representation, facilitated by our novel 3D-aware Rotary Positional Embedding. Additionally, we incorporate a decoupling characteristic in the multi-attention block to enhance the model's robustness in image-to-texture task, enabling semantically-correct back-view synthesis.Furthermore, we introduce a geometry-related Classifier-Free Guidance (CFG) mechanism to further improve the alignment with both geometries and images.Quantitative and qualitative evaluations, along with comprehensive user studies, demonstrate that our method achieves state-of-the-art results in texture quality and consistency.
Poster
Jiaqi Liao · Zhengyuan Yang · Linjie Li · Dianqi Li · Kevin Lin · Yu Cheng · Lijuan Wang

[ Exhibit Hall I ]

Abstract
In this work, we study the problem of Text-to-Image In-Context Learning (T2I-ICL).While Unified Multimodal LLMs (MLLMs) have advanced rapidly in recent years, they struggle with contextual reasoning in T2I-ICL scenarios. To address this limitation, we propose a novel framework that incorporates a thought process called ImageGen-CoT prior to image generation. However, we observe that MLLMs often produce unstructured reasoning steps, resulting in suboptimal outcomes. To tackle this issue, we develop an automatic pipeline to curate a high-quality ImageGen-CoT dataset. We then fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities. However, due to the complexity of T2I-ICL tasks, there is still significant room for improvement. To further enhance performance, we explore test-time scale-up strategies and propose a novel hybrid scaling approach. This approach first generates multiple ImageGen-CoT chains and then produces multiple images for each chain by varying the random seed. Extensive experiments demonstrate the effectiveness of our proposed method. Notably, fine-tuning with the ImageGen-CoT dataset leads to a substantial 80\% performance gain for SEED-X on T2I-ICL tasks.
Poster
Eric Slyman · Mehrab Tanjim · Kushal Kafle · Stefan Lee

[ Exhibit Hall I ]

Abstract
Multimodal large language models (MLLMs) are increasingly used to evaluate text-to-image (TTI) generation systems, providing automated judgments based on visual and textual context. However, these "judge" models often suffer from biases, overconfidence, and inconsistent performance across diverse image domains. While prompt ensembling has shown promise for mitigating these issues in unimodal, text-only settings, our experiments reveal that standard ensembling methods fail to generalize effectively for TTI tasks. To address these limitations, we propose a new multimodal-aware method called **M**ultimodal **M**ixture-of-**B**ayesian Prompt Ensembles (MMB). Our approach uses a Bayesian prompt ensemble approach augmented by image clustering, allowing the judge to dynamically assign prompt weights based on the visual characteristics of each sample. We show that MMB improves accuracy in pairwise preference judgments and greatly enhances calibration, making it easier to gauge the judge’s true uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB outperforms existing baselines in alignment with human annotations and calibration across varied image content. Our findings highlight the importance of multimodal-specific strategies for judge calibration and suggest a promising path forward for reliable large-scale TTI evaluation.
Poster
Jisoo Kim · Wooseok Seo · Junwan Kim · Seungho Park · Sooyeon Park · Youngjae Yu

[ Exhibit Hall I ]

Abstract
Despite the remarkable success of text-to-video (T2V) generation, its large memory requirements limit deployment in resource-constrained environments, leading to extensive research on model pruning and knowledge distillation to enhance efficiency while preserving performance. However, existing distillation methods primarily rely on supervised fine-tuning (SFT) loss, which, due to the reduced capacity of pruned models, struggles to capture fine-grained details. This leads to averaged predictions and ultimately degrades overall quality. To mitigate this challenge, we propose an effective distillation method, \loss, that combines DPO and SFT, leveraging DPO’s ability to guide the student model in learning preferences for its limiting properties while de-emphasizing less critical ones, complemented by SFT to enhance overall performance. Along with \loss, our framework, \ours includes filtering and curation for high-quality datasets, as well as a step-by-step online approach for more effective learning. We implement our method on two baseline models, VideoCrafter2 and AnimateDiff, achieving parameter reduction of 36.2\% in VideoCrafter and 67.5\% in AnimateDiff motion module, while maintaining or even surpassing the performance of full models. Further experiments validate the effectiveness of our \loss loss and \ours framework, demonstrating their impact on efficient and high-quality video generation.
Poster
Renxi Cheng · Hongsong Wang · Yang Zhang · Chaolei Han · Jie Gui

[ Exhibit Hall I ]

Abstract
The rapid advancement of GAN and Diffusion models makes it more difficult to distinguish AI-generated images from real ones. Recent studies often use image-based reconstruction errors as an important feature for determining whether an image is AI-generated. However, these approaches typically incur high computational costs and also fail to capture the intrinsic noisy features present in the raw images. To solve these problems, we innovatively refine error extraction using bit-plane-based image processing, as lower bit planes indeed represent noise patterns in images. We introduce an effective bit-planes guided noisy image generation and exploit various image normalization strategies, including scaling and thresholding. Then, to amplify the noise signal for easier AI-generated image detection, we design a maximum gradient patch selection that applies multi-directional gradients to compute the noise score and selects the region with the highest score. Finally, we propose a lightweight and effective classification head and explore two different structures: noise-based classifier and noise-guided classifier. Extensive experiments on the GenImage benchmark demonstrate the outstanding performance of our method, which achieves an average accuracy of 98.9% (11.9%$\uparrow$) and shows excellent cross-generator generalization capability. Particularly, our method achieves an accuracy of over 98.2% from GAN to Diffusion and over 99.2% from Diffusion …
Poster
Lennart Bastian · Mohammad Rashed · Nassir Navab · Tolga Birdal

[ Exhibit Hall I ]

Abstract
Modeling the rotation of moving objects is a fundamental task in computer vision, yet $SO(3)$ extrapolation still presents numerous challenges: (1) unknown quantities such as the moment of inertia complicate dynamics, (2) the presence of external forces and torques can lead to non-conservative kinematics, and (3) estimating evolving state trajectories under sparse, noisy observations requires robustness.We propose modeling trajectories of noisy pose estimates on the manifold of 3D rotations in a physically and geometrically meaningful way by leveraging Neural Controlled Differential Equations guided with $SO(3)$ Savitzky-Golay paths.Existing extrapolation methods often rely on energy conservation or constant velocity assumptions, limiting their applicability in real-world scenarios involving non-conservative forces. In contrast, our approach is agnostic to energy and momentum conservation while being robust to input noise, making it applicable to complex, non-inertial systems. Our approach is easily integrated as a module in existing pipelines and generalizes well to trajectories with unknown physical parameters.By learning to approximate object dynamics from noisy states during training, our model attains robust extrapolation capabilities in simulation and various real-world settings.
Poster
Zichen Liu · Yihao Meng · Hao Ouyang · Yue Yu · Bolin Zhao · Daniel Cohen-Or · Huamin Qu

[ Exhibit Hall I ]

Abstract
Text animation serves as an expressive medium, transforming static communication into dynamic experiences by infusing words with motion to evoke emotions, emphasize meanings, and construct compelling narratives. Crafting animations that are semantically aware poses significant challenges, demanding expertise in graphic design and animation. We present an automated text animation scheme, termed "Dynamic Typography", which deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts. The animation is represented by a canonical field that aggregates the semantic content in a canonical shape and a deformation field that applies per-frame motion to deform the canonical shape. Two fields are jointly optimized by the priors from a large pretrained text-to-video diffusion model using score-distillation loss with designed regularization, encouraging the video coherence with the intended textual concept while maintaining legibility and structural integrity throughout the animation process. We demonstrate the generalizability of our approach across various text-to-video models and highlight the superiority of our methodology over baselines. Through quantitative and qualitative evaluations, we demonstrate the effectiveness of our framework in generating coherent text animations that faithfully interpret user prompts while maintaining readability.
Poster
Sicheng Zhang · Binzhu Xie · Zhonghao Yan · Yuli Zhang · Donghao Zhou · Xiaofei Chen · Shi Qiu · Jiaqi Liu · Guoyang Xie · Zhichao Lu

[ Exhibit Hall I ]

Abstract
Model performance in text-to-image (T2I) and image-to-image (I2I) generation often depends on multiple aspects, including quality, alignment, diversity, and robustness. However, models’ complex trade-offs among these dimensions have been rarely explored due to (1) the lack of datasets that allow fine-grained quantification of these trade-offs, and (2) using a single metric for multiple dimensions. To address this gap, we introduce **TRIG-Bench** (**Tr**ade-offs in **I**mage **G**eneration), which spans 10 dimensions (Realism, Originality, Aesthetics, Content, Relation, Style, Knowledge, Ambiguity, Toxicity and Bias), contains over 40,200 samples, and covers 132 **Pairwise Dimensional Subsets.** Furthermore, we develop **TRIGScore,** a VLM-as-judge metric that automatically adapts to various dimensions. Based on this, we evaluate 14 cutting-edge models across T2I and I2I tasks. In addition, we propose the Relation Recognition System and generate the Dimension Trade-off Map (**DTM**), which visualizes model-specific capability trade-offs. Our experiments demonstrate that DTM consistently provides a comprehensive understanding of the trade-offs between dimensions for each type of generation models. Notably, after fine-tuning on DTM, the model's dimension-specific impact is mitigated, and overall performance is enhanced.
Poster
Zeyi Sun · Ziyang Chu · Pan Zhang · Tong Wu · Xiaoyi Dong · Yuhang Zang · Yuanjun Xiong · Dahua Lin · Jiaqi Wang

[ Exhibit Hall I ]

Abstract
Recent advances in large language models have enabled task prompting for open-ended text generation. In the vision domain, a longstanding goal is developing models capable of general visual learning, encompassing tasks such as image generation, editing, low-level processing, and dense perception. Although recent efforts have aimed at building vision foundation models that support prompting, significant challenges remain, particularly in accurately comprehending visual prompts and addressing the ambiguity inherent in textual prompts. To address this, we introduce X-Prompt, a purely auto-regressive large vision-language model designed for generalizable visual learning via in-context prompting. X-Prompt can process visual and textual prompts as context, enabling precise task interpretation and accurate execution. A novel prompt-token fusion mechanism effectively extracts relevant task information from complex prompts while significantly reducing the token length. Additionally, a unified training strategy for text and image prediction enhances task awareness, enabling seamless adaptation to open-ended prompts. Extensive experiments demonstrate that X-Prompt effectively interprets in-context prompts and exhibits generalization across both in-domain and out-of-domain visual tasks, paving the way for future advancements in general visual learning.
Poster
Yuwei Guo · Ceyuan Yang · Ziyan Yang · Zhibei Ma · Zhijie Lin · Zhenheng Yang · Dahua Lin · Lu Jiang

[ Exhibit Hall I ]

Abstract
Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and semantic consistency across shots. In this work, we introduce Long Context Tuning (LCT), a training paradigm that extends the context window of pre-trained single-shot video diffusion models to learn scene-level consistency directly from data. Our method expands full attention mechanisms from individual shots to encompass all shots within a scene, incorporating interleaved 3D position embedding and an asynchronous noise strategy, enabling both joint and auto-regressive shot generation without additional parameters. Models with bidirectional attention after LCT can further be fine-tuned with context-causal attention, facilitating auto-regressive generation with efficient KV-cache. Experiments demonstrate single-shot models after LCT can produce coherent multi-shot scenes and exhibit emerging capabilities, including composable generation and interactive shot extension, paving the way for more practical visual content creation.
Poster
Junjia Huang · Pengxiang Yan · Jiyang Liu · Jie Wu · Zhao Wang · Yitong Wang · Liang Lin · Guanbin Li

[ Exhibit Hall I ]

Abstract
Image fusion seeks to seamlessly integrate foreground objects with background scenes, producing realistic and harmonious fused images. Unlike existing methods that directly insert objects into the background, adaptive and interactive fusion remains a challenging yet appealing task. It requires the foreground to adjust or interact with the background context, enabling more coherent integration. To address this, we propose an iterative human-in-the-loop data generation pipeline, which leverages limited initial data with diverse textual prompts to generate fusion datasets across various scenarios and interactions, including placement, holding, wearing, and style transfer. Building on this, we introduce DreamFuse, a novel approach based on the Diffusion Transformer (DiT) model, to generate consistent and harmonious fused images with both foreground and background information. DreamFuse employs a Positional Affine mechanism to inject the size and position of the foreground into the background, enabling effective foreground-background interaction through shared attention. Furthermore, we apply Localized Direct Preference Optimization guided by human feedback to refine DreamFuse, enhancing background consistency and foreground harmony. DreamFuse achieves harmonious fusion while generalizing to text-driven attribute editing of the fused results.Experimental results demonstrate that our method outperforms state-of-the-art approaches across multiple metrics.
Poster
Ziye Li · Xincheng Shuai · Hao Luo · Henghui Ding

[ Exhibit Hall I ]

Abstract
Recent advancements in video generation, particularly in diffusion models, have driven notable progress in text-to-video (T2V) and image-to-video (I2V) synthesis. However, challenges remain in effectively integrating dynamic motion signals and flexible spatial constraints. Existing T2V methods typically rely on text prompts, which inherently lack precise control over the spatial layout of generated content. In contrast, I2V methods are limited by their dependence on real images, which restricts the editability of the synthesized content. Although some methods incorporate ControlNet to introduce image-based conditioning, they often lack explicit motion control and require computationally expensive training. To address these limitations, we propose AnyI2V, a training-free framework that animates any conditional images with user-defined motion trajectories. AnyI2V supports a broader range of modalities as the conditional image, including data types such as meshes and point clouds that are not supported by ControlNet, enabling more flexible and versatile video generation. Additionally, it supports mixed conditional inputs and enables style transfer and editing via LoRA and text prompts. Extensive experiments demonstrate that the proposed AnyI2V achieves state-of-the-art performance, marking a significant advancement in spatial- and motion-controlled video generation.
Poster
Jiarui Wang · Huiyu Duan · Yu Zhao · Juntong Wang · Guangtao Zhai · Xiongkuo Min

[ Exhibit Hall I ]

Abstract
Recent breakthroughs in large multimodal models (LMMs) have significantly advanced both text-to-image (T2I) generation and image-to-text (I2T) interpretation. However, many generated images still suffer from issues related to perceptual quality and text-image alignment. Given the high cost and inefficiency of manual evaluation, an automatic metric that aligns with human preferences is desirable. To this end, we present EvalMi-50K, a comprehensive dataset and benchmark for evaluating large-multimodal image generation,which features (i) comprehensive tasks, encompassing 2,100 extensive prompts across 20 fine-grained task dimensions, and (ii) large-scale human-preference annotations, including 100K mean-opinion scores (MOSs) and 50K question-answering (QA) pairs annotated on 50,400 images generated from 24 T2I models.Based on EvalMi-50K, we propose LMM4LMM, an LMM-based metric for evaluating large multimodal T2I generation from multiple dimensions including perceptual quality, text-image correspondence, and task-specific accuracy.Extensive experimental results show that LMM4LMM achieves state-of-the-art performance on EvalMi-50K, and exhibits strong generalization ability on other AI-generated image evaluation benchmark datasets, manifesting the generality of both the EvalMi-50K dataset and LMM4LMM metric.Both EvalMi-50K and LMM4LMM will be released upon the publication.
Poster
Pengzhen Chen · Yanwei Liu · Xiaoyan Gu · Enci Liu · Zhuoyi Shang · Xiangyang Ji · Wu Liu

[ Exhibit Hall I ]

Abstract
Diffusion models have significantly advanced the field of image synthesis, making the protection of their intellectual property (IP) a critical concern. Existing IP protection methods primarily focus on embedding watermarks into generated images by altering the structure of the diffusion process. However, these approaches inevitably compromise the quality of the generated images and are particularly vulnerable to fine-tuning attacks, especially for open-source models such as Stable Diffusion (SD). In this paper, we propose PlugMark, a novel plug-in zero-watermarking framework for diffusion models. The core idea of PlugMark is based on two observations: a classifier can be uniquely characterized by its decision boundaries, and a diffusion model can be uniquely represented by the knowledge acquired from training data.Building on this foundation, we introduce a diffusion knowledge extractor that can be plugged into a diffusion model to extract its knowledge and output a classification result. PlugMark subsequently generates boundary representations based on this classification result, serving as a zero-distortion watermark that uniquely represents the decision boundaries and, by extension, the knowledge of the diffusion model. Since only the extractor requires training, the performance of the original diffusion model remains unaffected.Extensive experimental results demonstrate that PlugMark can robustly extract high-confidence zero-watermarks from both …
Poster
Yongsheng Yu · Ziyun Zeng · Haitian Zheng · Jiebo Luo

[ Exhibit Hall I ]

Abstract
Diffusion-based generative models have revolutionized object-oriented image editing, yet their deployment in realistic object removal and insertion remains hampered by challenges such as the intricate interplay of physical effects and insufficient paired training data. In this work, we introduce OmniPaint, a unified framework that re-conceptualizes object removal and insertion as interdependent processes rather than isolated tasks. Leveraging a pre-trained diffusion prior along with a progressive training pipeline comprising initial paired sample optimization and subsequent large-scale unpaired refinement via CycleFlow, OmniPaint achieves precise foreground elimination and seamless object insertion while faithfully preserving scene geometry and intrinsic properties. Furthermore, our novel CFD metric offers a robust, reference-free evaluation of context consistency and object hallucination, establishing a new benchmark for high-fidelity image editing.
Poster
Yuxin Jiang · Liming Jiang · Shuai Yang · Jia-Wei Liu · Ivor Tsang · Mike Zheng Shou

[ Exhibit Hall I ]

Abstract
We present Style Matching Score (SMS), a novel optimization method for image stylization with diffusion models. Balancing effective style transfer with content preservation is a long-standing challenge. Unlike existing efforts, our method reframes image stylization as a style distribution matching problem. The target style distribution is estimated from off-the-shelf style-dependent LoRAs via carefully designed score functions. To preserve content information adaptively, we propose Progressive Spectrum Regularization, which operates in the frequency domain to guide stylization progressively from low-frequency layouts to high-frequency details. In addition, we devise a Semantic-Aware Gradient Refinement technique that leverages relevance maps derived from diffusion semantic priors to selectively stylize semantically important regions. The proposed optimization formulation extends stylization from pixel space to parameter space, readily applicable to lightweight feedforward generators for efficient one-step stylization. SMS effectively balances style alignment and content preservation, outperforming state-of-the-art approaches, verified by extensive experiments. Code and models will be released.
Poster
Ge Gao · Siyue Teng · Tianhao Peng · Fan Zhang · David Bull

[ Exhibit Hall I ]

Abstract
While video compression based on implicit neural representations (INRs) has recently demonstrated great potential, existing INR-based video codecs still cannot achieve state-of-the-art (SOTA) performance compared to their conventional or autoencoder-based counterparts given the same coding configuration. In this context, we propose a **G**enerative **I**mplicit **Vi**deo **C**ompression framework, **GIViC**, aiming at advancing the performance limits of this type of coding methods. GIViC is inspired by the characteristics that INRs share with large language and diffusion models in exploiting *long-term dependencies*. Through the newly designed *implicit diffusion* process, GIViC performs diffusive sampling across coarse-to-fine spatiotemporal decompositions, gradually progressing from coarser-grained full-sequence diffusion to finer-grained per-token diffusion. A novel **Hierarchical Gated Linear Attention-based transformer** (HGLA), is also integrated into the framework, which dual-factorizes global dependency modeling along scale and sequential axes. The proposed GIViC model has been benchmarked against SOTA conventional and neural codecs using a Random Access (RA) configuration (YUV 4:2:0, GOPSize=32), and yields BD-rate savings of 15.94%, 22.46% and 8.52% over VVC VTM, DCVC-FM and NVRC, respectively. As far as we are aware, GIViC is the **first INR-based video codec that outperforms VTM based on the RA coding configuration**. The source code will be made available.
Poster
Peyman Gholami · Robert Xiao

[ Exhibit Hall I ]

Abstract
Denoising diffusion models have emerged as powerful tools for image manipulation, yet interactive, localized editing workflows remain underdeveloped. We introduce Layered Diffusion Brushes (LDB), a novel framework that facilitates real-time and iterative image editing with fine-grained, region-specific control. LDB leverages a unique approach that caches intermediate latent states within the diffusion process, enabling users to apply prompt-guided edits via masks in a non-destructive, layered manner. Key innovations include latent caching for significant speed enhancements (achieving edits in under 140ms on consumer GPUs) and redefining layering for diffusion models with an order-agnostic system that allows for independent manipulation and stacking of edits, even in overlapping regions. An editor implementing LDB, incorporating familiar layer concepts, was evaluated through user study and quantitative metrics. Results demonstrate LDB's superior speed alongside comparable or improved image quality, background preservation, and edit fidelity relative to existing state-of-the-art techniques across various sequential image manipulation tasks. The findings highlight LDB's potential to significantly enhance creative workflows by providing an intuitive and efficient approach to diffusion-based image editing and its potential for expansion into related subdomains, such as video editing.
Poster
Tianyu Zhang · Xin Luo · Li Li · Dong Liu

[ Exhibit Hall I ]

Abstract
Diffusion-based image compression has shown remarkable potential for achieving ultra-low bitrate coding (less than 0.05 bits per pixel) with high realism, by leveraging the generative priors of large pre-trained text-to-image diffusion models. However, current approaches require a large number of denoising steps at the decoder to generate realistic results under extreme bitrate constraints, limiting their application in real-time compression scenarios. Additionally, these methods often sacrifice reconstruction fidelity, as diffusion models typically fail to guarantee pixel-level consistency. To address these challenges, we introduce StableCodec, which enables one-step diffusion for high-fidelity and high-realism extreme image compression with improved coding efficiency. To achieve ultra-low bitrates, we first develop an efficient Deep Compression Latent Codec to transmit a noisy latent representation for a single-step denoising process. We then propose a Dual-Branch Coding Structure, consisting of a pair of auxiliary encoders and decoders, to enhance reconstruction fidelity. Furthermore, we adopt end-to-end optimization with joint bitrate and pixel-level constraints. Extensive experiments on the CLIC 2020, DIV2K, and Kodak dataset demonstrate that StableCodec outperforms existing methods in terms of FID, KID and DISTS by a significant margin, even at bitrates as low as 0.005 bits per pixel, while maintaining strong fidelity. Additionally, StableCodec achieves inference speeds comparable …
Poster
Peng Zheng · Junke Wang · Yi Chang · Yizhou Yu · Rui Ma · Zuxuan Wu

[ Exhibit Hall I ]

Abstract
Recent advances in large language models (LLMs) have spurred interests in encoding images as discrete tokens and leveraging autoregressive (AR) frameworks for visual generation. However, the quantization process in AR-based visual generation models inherently introduces information loss that degrades image fidelity. To mitigate this limitation, recent studies have explored to autoregressively predict continuous tokens. Unlike discrete tokens that reside in a structured and bounded space, continuous representations exist in an unbounded, high-dimensional space, making density estimation more challenging and increasing the risk of generating out-of-distribution artifacts. Based on the above findings, this work introduces $\textbf{DisCon}$ (Discrete-Conditioned Continuous Autoregressive Model), a novel framework that reinterprets discrete tokens as conditional signals rather than generation targets. By modeling the conditional probability of continuous representations conditioned on discrete tokens, DisCon circumvents the optimization challenges of continuous token modeling while avoiding the information loss caused by quantization. DisCon achieves a gFID score of $\textbf{1.38}$ on ImageNet 256$\times$256 generation, outperforming state-of-the-art autoregressive approaches by a clear margin.
Poster
ZiYi Dong · Chengxing Zhou · Weijian Deng · Pengxu Wei · Xiangyang Ji · Liang Lin

[ Exhibit Hall I ]

Abstract
Contemporary diffusion models built upon U-Net or Diffusion Transformer (DiT) architectures have revolutionized image generation through transformer-based attention mechanisms. The prevailing paradigm has commonly employed self-attention with quadratic computational complexity to handle global spatial relationships in complex images, thereby synthesizing high-fidelity images with coherent visual semantics. Contrary to conventional wisdom, our systematic layer-wise analysis reveals an interesting discrepancy: self-attention in pre-trained diffusion models predominantly exhibits localized attention patterns, closely resembling convolutional inductive biases. This suggests that global interactions in self-attention may be less critical than commonly assumed. Driven by this, we propose \(\Delta\)ConvFusion to replace conventional self-attention modules with Pyramid Convolution Blocks (\(\Delta\)ConvBlocks). By distilling attention patterns into localized convolutional operations while keeping other components frozen, \(\Delta\)ConvFusion achieves performance comparable to transformer-based counterparts while reducing computational cost by 6929× and surpassing LinFusion by 5.42× in efficiency—all without compromising generative fidelity.
Poster
Hongwei Yu · Xinlong Ding · Jiawei Li · Jinlong Wang · Yudong Zhang · Rongquan Wang · Huimin Ma · Jiansheng Chen

[ Exhibit Hall I ]

Abstract
While image conditional diffusion models demonstrate impressive generation capabilities, they exhibit high vulnerability when facing backdoor and adversarial attacks. In this paper, we define a scenario named diffusion anomaly where generated results of a reverse process under attack deviate significantly from the normal ones. By analyzing the underlying formation mechanism of the diffusion anomaly, we reveal how perturbations are amplified during the reverse process and accumulated in the results. Based on the analysis, we reveal the phenomena of divergence and homogeneity, which cause the diffusion process to deviate significantly from the normal process and to decline in diversity. Leveraging these two phenomena, we propose a method named Diffusion Anomaly Detection (DADet) to effectively detect both backdoor and adversarial attacks. Extensive experiments demonstrate that our proposal achieves excellent defense performance against backdoor and adversarial attacks. Specifically, for the backdoor attack detection, our method achieves an F1 score of 99\% on different datasets including MS COCO and CIFAR-10. For the detection of adversarial samples, the F1 score exceeds 84\% across three adversarial attacks and two different tasks, evaluated on the MS COCO and Places365 datasets respectively.
Poster
Junho Lee · Jeongwoo Shin · Hyungwook Choi · Joonseok Lee

[ Exhibit Hall I ]

Abstract
In spite of remarkable potential of the Latent Diffusion Models (LDMs) in image generation, the desired properties and optimal design of the autoencoders have been underexplored. In this work, we analyze the role of autoencoders in LDMs and identify three key properties: latent smoothness, perceptual compression quality, and reconstruction quality. We demonstrate that existing autoencoders fail to simultaneously satisfy all three properties, and propose Variational Masked AutoEncoders (VMAEs), taking advantage of the hierarchical features maintained by Masked AutoEncoder. We integrate VMAEs into the LDM framework, introducing Latent Diffusion Models with Masked AutoEncoders (LDMAEs).
Poster
Yukai Shi · Jiarong Ou · Rui Chen · Haotian Yang · Jiahao Wang · Xin Tao · Pengfei Wan · Di ZHANG · Kun Gai

[ Exhibit Hall I ]

Abstract
In visual generation tasks, the responses and combinations of complex concepts often lack stability and are error-prone, which remains an under-explored area. In this paper, we attempt to explore the causal factors for poor concept responses through elaborately designed experiments. We also design a concept-wise equalization loss function (IMBA loss) to address this issue. Our proposed method is online, eliminating the need for offline dataset processing, and requires minimal code changes. In our newly proposed complex concept benchmark Inert-CompBench and two other public test sets, our method significantly enhances the concept response capability of baseline models and yields highly competitive results with only a few codes.
Poster
Junyuan Zhang · Qintong Zhang · Bin Wang · Linke Ouyang · Zichen Wen · Ying Li · Ka-Ho Chow · Conghui He · Wentao Zhang

[ Exhibit Hall I ]

Abstract
Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining.As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR).However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises.In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems.OHRBench includes 8,561 carefully selected unstructured document images from seven real-world RAG application domains, along with 8,498 Q\&A pairs derived from multimodal elements in documents, challenging existing OCR solutions used for RAG.To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise.Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems.We then systematically evaluate the impact of these two noise types and demonstrate the trend relationship between the degree of OCR noise and RAG …
Poster
Yang Liu · Xudong Xie · Yuliang Liu · Xiang Bai

[ Exhibit Hall I ]

Abstract
Overlapping text poses significant challenges for text-related perception tasks, particularly in open scenes characterized by diverse fonts and visual effects. While existing research has primarily addressed the overlapping problem in documents, its applicability to other scenes remains limited. To bridge this gap, we propose a new task of multi-scenario overlapping text segmentation and introduce a corresponding real dataset in both English and Chinese, spanning various contexts such as printed text, bills, artistic designs, and house numbers. To further enhance the generalization of overlapping text segmentation models, we propose a hierarchical training data synthesis strategy that simulates diverse overlapping patterns across different scenarios. Furthermore, we found that depth maps can provide clear relative position relationships in three-dimensional space, assisting the model in capturing complex overlapping relationships between text instances. Building on this insight, we present a depth-guided decoder that seamlessly integrates image and depth features to capture overlapping interactions. Our proposed model achieves a 5.3% improvement in text mIoU and a 6.4% improvement in overall mIoU compared to existing SOTA methods on our benchmark and SignaTR6k datasets, respectively.
Poster
Xingsong Ye · Yongkun Du · Yunbo Tao · Zhineng Chen

[ Exhibit Hall I ]

Abstract
Scene text recognition (STR) suffers from challenges of either less realistic synthetic training data or the difficulty of collecting sufficient high-quality real-world data, limiting the effectiveness of trained models. Meanwhile, despite producing holistically appealing text images, diffusion-based visual text generation methods struggle to synthesize accurate and realistic instance-level text on a large scale. To tackle this, we introduce TextSSR: a novel pipeline for Synthesizing Scene Text Recognition training data. TextSSR targets three key synthesizing characteristics: accuracy, realism, and scalability. It achieves accuracy through a proposed region-centric text generation with position-glyph enhancement, ensuring proper character placement. It maintains realism by guiding style and appearance generation using contextual hints from surrounding text or background. This character-aware diffusion architecture enjoys precise character-level control and semantic coherence preservation, without relying on natural language prompts. Therefore, TextSSR supports large-scale generation through combinatorial text permutations. Based on these, we present TextSSR-F, a dataset of 3.55 million quality-screened text instances. Extensive experiments show that STR models trained on TextSSR-F outperform those trained on existing synthetic datasets by clear margins on common benchmarks, and further improvements are observed when mixed with real-world training data. Code is available in Supplementary Materials.
Poster
Zexuan Yan · Yue Ma · Chang Zou · Wenteng Chen · Qifeng Chen · Linfeng Zhang

[ Exhibit Hall I ]

Abstract
Inversion-based image editing is rapidly gaining momentum while suffering from significant computation overhead, hindering its application in real-time interactive scenarios. In this paper, we rethink that the redundancy in inversion-based image editing exists in both the spatial and temporal dimensions, such as the unnecessary computation in unedited regions and the redundancy in the inversion progress. To tackle these challenges, we propose a practical framework, named \textbf{EEdit}, to achieve efficient image editing. Specifically, we introduce three techniques to solve them one by one. \textbf{For spatial redundancy}, spatial locality caching is introduced to compute the edited region and its neighboring regions while skipping the unedited regions, and token indexing preprocessing is designed to further accelerate the caching. \textbf{For temporal redundancy}, inversion step skipping is proposed to reuse the latent for efficient editing. Our experiments demonstrate an average of \textbf{\textcolor{blue}{2.46}}$\times$ acceleration without performance drop in a wide range of editing tasks including prompt-guided image editing, dragging and image composition. Our codes are available in the supplementary material and will be released on Github.
Poster
Yuhan Li · Xianfeng Tan · Wenxiang Shang · Yubo Wu · Jian Wang · Xuanhong Chen · Yi Zhang · Zhu Hangcheng · Bingbing Ni

[ Exhibit Hall I ]

Abstract
Standard clothing asset generation involves restoring forward-facing flat-lay garment images displayed on a clear background by extracting clothing information from diverse real-world contexts, which presents significant challenges due to highly standardized structure sampling distributions and clothing semantic absence in complex scenarios. Existing models have limited spatial perception, often exhibiting structural hallucinations and texture distortion in this high-specification generative task. To address this issue, we propose a novel Retrieval-Augmented Generation (RAG) framework, termed RAGDiffusion, to enhance structure determinacy and mitigate hallucinations by assimilating knowledge from language models and external databases. RAGDiffusion consists of two processes: (1) Retrieval-based structure aggregation, which employs contrastive learning and a Structure Locally Linear Embedding (SLLE) to derive global structure and spatial landmarks, providing both soft and hard guidance to counteract structural ambiguities; and (2) Omni-level faithful garment generation, which introduces a coarse-to-fine texture alignment that ensures fidelity in pattern and detail components within the diffusing. Extensive experiments on challenging real-world datasets demonstrate that RAGDiffusion synthesizes structurally and texture-faithful clothing assets with significant performance improvements, representing a pioneering effort in high-specification faithful generation with RAG to confront intrinsic hallucinations and enhance fidelity.
Poster
Xi Yu · Xiang Gu · Zhihao Shi · Jian Sun

[ Exhibit Hall I ]

Abstract
Large-scale text-to-image diffusion models have achieved remarkable success in image generation, thereby driving the development of stylized image generation technologies. Recent studies introduce style information by empirically replacing specific features in attention block with style features. However, the relationship between features and style remains unclear. In this paper, we systematically analyze the relationship between features in attention blocks and style. By quantifying the distribution discrepancy induced by style variations using the Wasserstein distance, we find that features in self-attention blocks exhibit high sensitivity to style compared to features in cross-attention blocks. Our analysis provides valuable insights into the contribution of different features to style. Based on our findings, we propose a novel Wasserstein Style Distribution Transform (WSDT) method, which generates stylized images by transforming the distribution of style-sensitive features to align with that of style features. WSDT applies channel adaptive distribution transform to ensure that information not related to the style is not introduced. Our approach is simple yet efficient, optimization-free, and can be seamlessly integrated into attention-based text-to-image diffusion models. Extensive experiments demonstrate the effectiveness of our approach in stylized image generation tasks.
Poster
Liya Ji · Chenyang Qi · Qifeng Chen

[ Exhibit Hall I ]

Abstract
Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation.Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only \textit{single} modality ability, restricting the editing quality.We aim to bridge understanding and generation via a new \textit{multi-modality} model that provides the intelligent abilities to instruction-based image editing models for more complex cases.To achieve this goal, we separate the instruction editing task with the multi-modality chain of thought prompts, \ie, Chain-of-Thought (CoT) planning, editing region reasoning, and editing, individually. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network.For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, for editing image generations, a hint-guided instruction-based editing network is proposed based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images. Source codes will be publicly available.
Poster
Lisa Dunlap · Trevor Darrell · Joseph Gonzalez · Fabian Caba Heilbron · Josef Sivic · Bryan Russell

[ Exhibit Hall I ]

Abstract
In this paper, we investigate when and how visual representations learned by two different generative models {\bf diverge} from each other. Specifically, given two text-to-image models, our goal is to discover visual attributes that appear in images generated by one model but not the other, along with the types of prompts that trigger these attribute differences. For example, "flames" might appear in one model’s outputs when given prompts expressing strong emotions, while the other model does not produce this attribute given the same prompts.We introduce CompCon (Comparing Concepts), an evolutionary search algorithm that discovers visual attributes more prevalent in one model's output than the other, and uncovers the prompt concepts linked to these visual differences. To evaluate our method's ability to find diverging representations, we create an automated data generation pipeline to produce ID$^2$, a dataset of 60 input-dependent differences, and compare our approach to several LLM- and VLM-powered baselines. Finally, we apply CompCon to compare two popular text to image models, PixArt and SD-Lightning. We find diverging representations such as how prompts mentioning loneliness result in depictions of "wet streets" in PixArt, as well as bias like how PixArt generates older men for prompts mentioning traditional professions.
Poster
Aryan Yazdan Parast · Basim Azam · Naveed Akhtar

[ Exhibit Hall I ]

Abstract
Deep neural networks trained with Empirical Risk Minimization (ERM) perform well when both training and test data come from the same domain, but they often fail to generalize to out-of-distribution samples. In image classification, these models may rely on spurious correlations that often exist between labels and irrelevant features of images, making predictions unreliable when those features do not exist. We propose a Diffusion Driven Balancing (DDB) technique to generate training samples with text-to-image diffusion models for addressing the spurious correlation problem. First, we compute the best describing token for the visual features pertaining to the causal components of samples by a textual inversion mechanism. Then, leveraging a language segmentation method and a diffusion model, we generate new samples by combining the causal component with the elements from other classes. We also meticulously prune the generated samples based on the prediction probabilities and attribution scores of the ERM model to ensure their correct composition for our objective. Finally, we retrain the ERM model on our augmented dataset. This process reduces the model’s reliance on spurious correlations by learning from carefully crafted samples for in which this correlation does not exist. Our experiments show that across different benchmarks, our technique achieves …
Poster
Yuanshen Guan · Ruikang Xu · Yinuo Liao · Mingde Yao · Lizhi Wang · Zhiwei Xiong

[ Exhibit Hall I ]

Abstract
While diffusion models have demonstrated significant success in standard dynamic range (SDR) image synthesis, generating high dynamic range (HDR) images with higher luminance and broader color gamuts remains challenging. This arises primarily from two factors: (1) The incompatibility between pretrained SDR image auto-encoders and the high-bit-depth HDR images; (2) The lack of large-scale HDR image datasets for effective learning and supervision. In this paper, we propose a novel framework for HDR image generation with two key innovations: (1) Decomposed HDR Image Generation: We leverage a double-layer HDR image format to decompose the HDR image into two low-bit-depth components: an SDR image with a corresponding Gain Map (GM).This format is inherently compatible with pretrained SDR auto-encoders, motivating the decomposition of HDR image generation into SDR image and GM prediction. (2) Unsupervised Data Construction: We develop an automated pipeline to construct ``Text-SDR-GM" triplets from large-scale text-image datasets by brightness-aware compression and gamut-constrained reduction, enabling unsupervised learning of GMs without ground-truth data. Building upon these innovations, we adapt the Stable Diffusion model to jointly predict GMs and SDR images, enabling high-quality decomposed HDR image generation. Experiments show that our framework excels in HDR image generation and SDR-to-HDRTV up-conversion, generalizing well across diverse scenes …
Poster
Jongseo Lee · Kyungho Bae · Kyle Min · Gyeong-Moon Park · Jinwoo Choi

[ Exhibit Hall I ]

Abstract
In this work, we tackle the problem of video class-incremental learning (VCIL). Many existing VCIL methods mitigate catastrophic forgetting by rehearsal training with a few temporally dense samples stored in episodic memory, which is memory-inefficient. Alternatively, some methods store temporally sparse samples, sacrificing essential temporal information and thereby resulting in inferior performance.To address this trade-off between memory-efficiency and performance, we propose EpiSodic and SEmaNTIc memory integrAtion for video class-incremental Learning (ESSENTIAL).We are inspired by the human memory system, which integrates episodic and semantic memory for accurate information retrieval.ESSENTIAL consists of episodic memory for storing temporally sparse features and semantic memory for storing general knowledge represented by learnable prompts.We introduce a novel memory retrieval (MR) module that integrates episodic and semantic memory through cross-attention, enabling the retrieval of temporally dense features from temporally sparse features.We rigorously validate ESSENTIAL on diverse datasets: UCF-101, HMDB51, and Something-Something-V2 from the TCD benchmark and UCF-101, ActivityNet, and Kinetics-400 from the vCLIMB benchmark.Remarkably, with significantly reduced memory, ESSENTIAL achieves favorable performance on the benchmarks.
Poster
Yufan Liu · Wanqian Zhang · Huashan Chen · Lin Wang · Xiaojun Jia · Zheng Lin · Weiping Wang

[ Exhibit Hall I ]

Abstract
Despite rapid advancements in text-to-image (T2I) models, their safety mechanisms are vulnerable to adversarial prompts, which maliciously generate unsafe images. Current red-teaming methods for proactively assessing such vulnerabilities usually require white-box access to T2I models, and rely on inefficient per-prompt optimization, as well as inevitably generate semantically meaningless prompts easily blocked by filters. In this paper, we propose APT (AutoPrompT), a black-box framework that leverages large language models (LLMs) to automatically generate human-readable adversarial suffixes for benign prompts. We first introduce an alternating optimization-finetuning pipeline between adversarial suffix optimization and fine-tuning the LLM utilizing the optimized suffix. Furthermore, we integrates a dual-evasion strategy in optimization phase, enabling the bypass of both perplexity-based filter and blacklist word filter: (1) we constrain the LLM generating human-readable prompts through an auxiliary LLM perplexity scoring, which starkly contrasts with prior token-level gibberish, and (2) we also introduce banned-token penalties to suppress the explicit generation of banned-tokens in blacklist.Extensive experiments demonstrate the excellent red-teaming performance of our human-readable, filter-resistant adversarial prompts, as well as superior zero-shot transferability which enables instant adaptation to unseen prompts and exposes critical vulnerabilities even in commercial APIs (e.g., Leonardo.Ai.).
Poster
Jeonghoon Park · Juyoung Lee · Chaeyeon Chung · Jaeseong Lee · Jaegul Choo · Jindong Gu

[ Exhibit Hall I ]

Abstract
Recent advancements in diffusion-based text-to-image (T2I) models have enabled the generation of high-quality and photorealistic images from text descriptions. However, they often exhibit societal biases related to gender, race, and socioeconomic status, thereby reinforcing harmful stereotypes and shaping public perception in unintended ways. While existing bias mitigation methods demonstrate effectiveness, they often encounter attribute entanglement, where adjustments to attributes relevant to the bias (\ie, target attributes) unintentionally alter attributes unassociated with the bias (\ie, non-target attributes), causing undesirable distribution shifts. To address this challenge, we introduce Entanglement-Free Attention (EFA), a method that accurately incorporates target attributes (\eg, African, Asian, and Indian) while preserving non-target attributes (\eg, background details) during bias mitigation. At inference time, EFA randomly samples a target attribute with equal probability and adjusts the cross-attention in selected layers to incorporate the sampled attribute, achieving a fair distribution of target attributes. Extensive experiments demonstrate that EFA outperforms existing methods in mitigating bias while preserving non-target attributes, thereby maintaining the output distribution and generation capability of the original model.
Poster
Yufei Wang · Lanqing Guo · Zhihao Li · Jiaxing Huang · Pichao WANG · Bihan Wen · Jian Wang

[ Exhibit Hall I ]

Abstract
Text-guided image editing is an essential task, enabling users to modify images through natural language descriptions. Recent advances in diffusion models and rectified flows have significantly improved editing quality, primarily relying on inversion techniques to extract structured noise from input images. However, inaccuracies in inversion can propagate errors, leading to unintended modifications and compromising fidelity. Moreover, even with perfect inversion, the entanglement between textual prompts and image features often results in global changes when only local edits are intended. To address these challenges, we propose a novel text-guided image editing framework based on VAR (Visual AutoRegressive modeling), which eliminates the need for explicit inversion while ensuring precise and controlled modifications. Our method introduces a caching mechanism that stores token indices and probability distributions from the original image, capturing the relationship between the source prompt and the image. Using this cache, we design an adaptive fine-grained masking strategy that dynamically identifies and constrains modifications to relevant regions, preventing unintended changes. A token re-assembling approach further refines the editing process, enhancing diversity, fidelity, and control. Our framework operates in a training-free manner and achieves high-fidelity editing with faster inference speeds, processing a 1K resolution image in as fast as 1.2 seconds. Extensive …
Poster
Jiahui Yang · Yongjia Ma · Donglin Di · Hao Li · Chen Wei · Xie Yan · Jianxun Cui · Xun Yang · Wangmeng Zuo

[ Exhibit Hall I ]

Abstract
Existing text-to-image models often rely on parameter fine-tuning techniques such as Low-Rank Adaptation (LoRA) to customize visual attributes, but suffer from cross-attribute interference when combining multiple LoRA models. This interference stems from unstructured modifications of weight matrices, particularly evident in content-style fusion tasks where merging adaptations leads to undesired feature entanglement.We propose QR-LoRA, a novel fine-tuning framework leveraging QR decomposition for structured parameter updates that effectively separate visual attributes. Our key insight is that the orthogonal Q matrix naturally minimizes interference between different visual features, while the upper triangular R matrix efficiently encodes attribute-specific transformations.Our approach fixes both Q and R matrices while only training an additional task-specific $\Delta R$ matrix. This structured design reduces trainable parameters to half of conventional LoRA methods and supports effective merging of multiple adaptations without cross-contamination due to the strong disentanglement properties between $\Delta R$ matrices.Extensive experiments demonstrate that QR-LoRA achieves superior disentanglement in content-style fusion tasks, establishing a new paradigm for parameter-efficient, disentangled fine-tuning in generative models.
Poster
jing Yang · Qunliang Xing · Mai Xu · Minglang Qiao

[ Exhibit Hall I ]

Abstract
Joint Photographic Experts Group (JPEG) achieves data compression by quantizing Discrete Cosine Transform (DCT) coefficients, which inevitably introduces compression artifacts. Most existing JPEG quality enhancement methods operate in the pixel domain, suffering from the high computational costs of decoding. Consequently, direct enhancement of JPEG images in the DCT domain has gained increasing attention. However, current DCT-domain methods often exhibit limited performance. To address this challenge, we identify two critical types of correlations within the DCT coefficients of JPEG images. Building on this insight, we propose an Advanced DCT-domain JPEG Quality Enhancement (AJQE) method that fully exploits these correlations. The AJQE method enables the adaptation of numerous well-established pixel-domain models to the DCT domain, achieving superior performance with reduced computational complexity. Compared to the pixel-domain counterparts, the DCT-domain models derived by our method demonstrate a 0.35 dB improvement in PSNR and a 60.5% increase in enhancement throughput on average. The code will be made publicly available.
Poster
Junxiang Qiu · Lin Liu · Shuo Wang · Jinda Lu · Kezhou Chen · Yanbin Hao

[ Exhibit Hall I ]

Abstract
Feature caching has emerged as an effective strategy to accelerate diffusion transformer (DiT) sampling through temporal feature reuse. It is a challenging problem since (1) Progressive error accumulation from cached blocks significantly degrades generation quality, particularly when over 50\% of blocks are cached; (2) Current error compensation approaches neglect dynamic perturbation patterns during the caching process, leading to suboptimal error correction. To solve these problems, we propose the Gradient-Optimized Cache (GOC) with two key innovations: (1) Cached Gradient Propagation: A gradient queue dynamically computes the gradient differences between cached and recomputed features. These gradients are weighted and propagated to subsequent steps, directly compensating for the approximation errors introduced by caching. (2) Inflection-Aware Optimization: Through statistical analysis of feature variation patterns, we identify critical inflection points where the denoising trajectory changes direction. By aligning gradient updates with these detected phases, we prevent conflicting gradient directions during error correction. Extensive evaluations on ImageNet demonstrate GOC's superior trade-off between efficiency and quality. With 50\% cached blocks, GOC achieves IS 216.28 (26.3\%↑) and FID 3.907 (43\%↓) compared to baseline DiT, while maintaining identical computational costs. These improvements persist across various cache ratios, demonstrating robust adaptability to different acceleration requirements. The code is availableat Supplementary …
Poster
Ruoyu Wang · Huayang Huang · Ye Zhu · Olga Russakovsky · Yu Wu

[ Exhibit Hall I ]

Abstract
In this work, we introduce **NoiseQuery** as a novel method for enhanced noise initialization in versatile goal-driven text-to-image (T2I) generation. Specifically, we propose to leverage an aligned Gaussian noise as implicit guidance to complement explicit user-defined inputs, such as text prompts, for better generation quality and controllability. Unlike existing noise optimization methods designed for specific models, our approach is grounded in a fundamental examination of the generic finite-step noise scheduler design in diffusion formulation, allowing better generalization across different diffusion-based architectures in a **tuning-free manner**. This model-agnostic nature allows us to construct a reusable noise library compatible with multiple T2I models and enhancement techniques, serving as a foundational layer for more effective generation. Extensive experiments demonstrate that **NoiseQuery** enables fine-grained control and yields significant performance boosts not only over high-level semantics but also over **low-level visual attributes**, which are typically difficult to specify through text alone, with seamless integration into current workflows with minimal computational overhead.
Poster
Aniruddha Mahapatra · Long Mai · David Bourgin · Yitian Zhang · Feng Liu

[ Exhibit Hall I ]

Abstract
Video tokenizers are essential for latent video diffusion models, converting raw video data into spatiotemporally compressed latent spaces for efficient training. However, extending state-of-the-art video tokenizers to achieve a temporal compression ratio beyond 4× without increasing channel capacity poses significant challenges. In this work, we propose an alternative approach to enhance temporal compression. We find that the reconstruction quality of temporally subsampled videos from a low-compression encoder surpasses that of high-compression encoders applied to original videos. This indicates that high-compression models can leverage representations from lower-compression models. Building on this insight, we develop a bootstrapped high-temporal-compression model that progressively trains high-compression blocks atop well-trained lower-compression models. Our method includes a cross-level feature-mixing module to retain information from the pretrained low-compression model and guide higher-compression blocks to capture the remaining details from the full video sequence. Evaluation of video benchmarks shows that our method significantly improves reconstruction quality while increasing temporal compression compared to directly training the full model. Furthermore, the resulting compact latent space effectively trains a video diffusion model for high-quality video generation with a significantly reduced token budget.
Poster
Shengqi Liu · Yuhao Cheng · Zhuo Chen · Xingyu Ren · Wenhan Zhu · Lincheng Li · Mengxiao Bi · Xiaokang Yang · Yichao Yan

[ Exhibit Hall I ]

Abstract
Generating sewing patterns in garment design is receiving increasing attention due to its CG-friendly and flexible-editing nature. Previous sewing pattern generation methods have been able to produce exquisite clothing, but struggle to design complex garments with detailed control. To address these issues, we propose **SewingLDM**, a multi-modal generative model that generates sewing patterns controlled by text prompts, body shapes, and garment sketches. Initially, we extend the original vector of sewing patterns into a more comprehensive representation to cover more intricate details and then compress them into a compact latent space. To learn the sewing pattern distribution in the latent space, we design a two-step training strategy to inject the multi-modal conditions, i.e., body shapes, text prompts, and garment sketches, into a diffusion model, ensuring the generated garments are body-suited and detail-controlled. Comprehensive qualitative and quantitative experiments show the effectiveness of our proposed method, significantly surpassing previous approaches in terms of complex garment design and various body adaptability.
Poster
Shijie Huang · Yiren Song · Yuxuan Zhang · Hailong Guo · Xueyin Wang · Jiaming Liu

[ Exhibit Hall I ]

Abstract
We introduce ArtEditor, a novel framework for instruction-based image editing that learns unique editing styles from few-shot examples. While image editing has seen significant advancements, customized instructional editing remains underexplored. Existing methods often rely on complex, multi-stage pipelines that are difficult to adapt to specific styles. Additionally, this domain lacks a standardized benchmark, making it challenging to evaluate progress. To address these issues, we propose ArtEditor, a two-stage training framework. In the first stage, we train ArtEditor-Base, a general-purpose image editing model, on large-scale datasets to build a strong foundational capability. In the second stage, we fine-tune this model using ArtEditor-LoRA, a lightweight adaptation module, on a small dataset of before-and-after image pairs. This approach enables the model to efficiently learn distinct editing styles and techniques with minimal data. To enhance the performance of a pre-trained Diffusion Transformer (DiT) model, we introduce two key innovations: position encoding cloning and a noise-free conditioning paradigm. These techniques ensure stable and coherent edits, even when adapting to new styles. To support research in this area, we contribute the DoodleArt dataset, the first benchmark specifically designed for customized image editing. DoodleArt features six high-quality artistic styles created by professional artists and designers, providing a …
Poster
Yikang Zhou · Tao Zhang · Shilin Xu · Shihao Chen · Qianyu Zhou · Yunhai Tong · Shunping Ji · Jiangning Zhang · Lu Qi · Xiangtai Li

[ Exhibit Hall I ]

Abstract
Recent advancements in multimodal large language models (MLLM) have shown a strong ability in visual perception, reasoning abilities, and vision-language understanding. However, the visual matching ability of MLLMs is rarely studied, despite finding the visual correspondence of objects is essential in computer vision. Our research reveals that the matching capabilities in recent MLLMs still exhibit systematic shortcomings, even with current strong MLLMs models, GPT-4o. In particular, we construct a Multimodal Visual Matching (MMVM) benchmark to fairly benchmark over 30 different MLLMs. The MMVM benchmark is built from 15 open-source datasets and Internet videos with manual annotation. In addition, we have designed an automatic annotation pipeline to generate the MMVM SFT dataset, including 220K visual matching data with reasoning annotation. To our knowledge, this is the first MLLMs dataset and benchmark for the MLLM community. Finally, we present CoLVA, a novel contrastive MLLM with two novel technical designs: fine-grained vision expert with object-level contrastive learning and instruction augmentation strategy. The former learns instance discriminative tokens, while the latter further improves instruction following ability. CoLVA-InternVL2-4B achieves an overall accuracy (OA) of 49.80% on the MMVM benchmark, surpassing GPT-4o and the best open-source MLLM, Qwen2VL-72B, by 7.15% and 11.72% OA, respectively. These results …
Poster
Yunqiu Xu · Linchao Zhu · Yi Yang

[ Exhibit Hall I ]

Abstract
While multimodal large language models (MLLMs) have demonstrated extraordinary vision-language understanding capabilities, their abilities to solve instance-level visual-language problems beyond a single image warrant further exploration. To assess these unproven abilities of MLLMs, this paper proposes a new visual grounding task called multi-context visual grounding, which aims to localize instances of interest across multiple images based on open-ended text prompts. In order to facilitate this research, we construct a new dataset MC-Bench that features 2K high-quality and manually annotated samples. Each sample consists of an instance-level labeled image pair and a corresponding text prompt that indicates the target instances in the images. These text prompts are highly open-ended and follow three distinct styles, covering 20 practical skills. We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities, along with our developed simple yet effective agentic baseline and a finetuned baseline by multi-context instruction tuning. Our evaluation reveals a non-trivial performance gap between existing MLLMs and humans, along with some insightful observations that suggest potential future directions. We hope that MC-Bench and our empirical findings encourage the research community to further explore and enhance the untapped potentials of MLLMs in instance-level tasks, particularly in multi-image contexts. …
Poster
zikai zhou · Shitong Shao · Lichen Bai · Shufei Zhang · zhiqiang xu · Bo Han · Zeke Xie

[ Exhibit Hall I ]

Abstract
Text-to-image diffusion model is a popular paradigm that synthesizes personalized images by providing a text prompt and a random Gaussian noise. While people observe that some noises are "golden noises'' that can achieve better text-image alignment and higher human preference than others, we still lack a machine learning framework to obtain those golden noises. To learn golden noises for diffusion sampling, we mainly make three contributions in this paper. First, we identify a new concept termed the noise prompt, which aims at turning a random Gaussian noise into a golden noise by adding a small desirable perturbation derived from the text prompt. Following the concept, we first formulate the noise prompt learning framework that systematically learns "prompted'' golden noise associated with a text prompt for diffusion models. Second, we design a noise prompt data collection pipeline and collect a large-scale noise prompt dataset (NPD) that contains 100k pairs of random noises and golden noises with the associated text prompts. With the prepared NPD as the training dataset, we trained a small noise prompt network (NPNet) that can directly learn to transform a random noise into a golden noise. The learned golden noise perturbation can be considered as a kind of …
Poster
JUNHAO WEI · YU ZHE · Jun Sakuma

[ Exhibit Hall I ]

Abstract
Model merging is a technique that combines multiple finetuned models into a single model without additional training, allowing a free-rider to cheaply inherit specialized capabilities. This study investigates methodologies to suppress unwanted model merging by free-riders. Existing methods such as model watermarking or fingerprinting can only detect merging in hindsight. In contrast, we propose a first proactive defense against model merging. Specifically, our defense method modifies the model parameters so that the model is disrupted if the model is merged with any other model, while its functionality is kept unchanged if not merged with others. Our approach consists of two modules, rearranging MLP parameters and scaling attention heads, which push the model out of the shared basin in parameter space, causing the merging performance with other models to degrade significantly. We conduct extensive experiments on image classification, image generation, and text classification to demonstrate that our defense severely disrupts merging while retaining the functionality of the post-protect model. Moreover, we analyze potential adaptive attacks and further propose a dropout-based pruning to improve our proposal's robustness. Our code is available in the appendix.
Poster
Guanjie Chen · Xinyu Zhao · Yucheng Zhou · Xiaoye Qu · Tianlong Chen · Yu Cheng

[ Exhibit Hall I ]

Abstract
Diffusion Transformers (DiT) have emerged as a powerful architecture for image and video generation, offering superior quality and scalability. However, their practical application suffers from inherent dynamic feature instability, leading to error amplification during cached inference. Through systematic analysis, we identify the absence of long-range feature preservation mechanisms as the root cause of unstable feature propagation and perturbation sensitivity. To this end, we propose Skip-DiT, a novel DiT variant enhanced with Long-Skip-Connections (LSCs) - the key efficiency component in U-Nets. Theoretical spectral norm and visualization analysis demonstrate how LSCs stabilize feature dynamics. Skip-DiT architecture and its stabilized dynamic feature enable an efficient statical caching mechanism that reuses deep features across timesteps while updating shallow components. Extensive experiments across image and video generation tasks demonstrate that Skip-DiT achieves: (1) **4.4$\times$** training acceleration and faster convergence, (2) **1.5-2$\times$** inference acceleration without quality loss and high fidelity to original output, outperforming existing DiT caching methods across various quantitative metrics. Our findings establish long-skip connections as critical architectural components for training stable and efficient diffusion transformers. Codes are provided in the anonymous URL https://anonymous.4open.science/r/Skip-DiT-72B7/.
Poster
Yihong Luo · Tianyang Hu · Jiacheng Sun · Yujun Cai · Jing Tang

[ Exhibit Hall I ]

Abstract
Accelerating diffusion model sampling is crucial for efficient AIGC deployment. While diffusion distillation methods -- based on distribution matching and trajectory matching -- reduce sampling to as few as one step, they fall short on complex tasks like text-to-image generation. Few-step generation offers a better balance between speed and quality, but existing approaches face a persistent trade-off: distribution matching lacks flexibility for multi-step sampling, while trajectory matching often yields suboptimal image quality.To bridge this gap, we propose learning few-step diffusion models by Trajectory Distribution Matching (TDM), a unified distillation paradigm that combines the strengths of distribution and trajectory matching. Our method introduces a data-free score distillation objective, aligning the student’s trajectory with the teacher’s at the distribution level. Further, we develop a sampling-steps-aware objective that decouples learning targets across different steps, enabling more adjustable sampling.This approach supports both deterministic sampling for superior image quality and flexible multi-step adaptation, achieving state-of-the-art performance with remarkable efficiency. Our model, TDM, outperforms existing methods on various backbones, such as SDXL and PixArt-$\alpha$, delivering superior quality and significantly reduced training costs.In particular, our method distills PixArt-$\alpha$ into a 4-step generator that outperforms its teacher on real user preference at 1024 resolution. This is accomplished with …
Poster
LiWei Wang · YanDuo Zhang · Tao Lu · Fang Liu · Huiqin Zhang · Jiayi Ma · Huabing Zhou

[ Exhibit Hall I ]

Abstract
Dynamic Scene Graph Generation (DSGG) aims to comprehensively understand videos by abstracting them into visual triplets $<$\textit{subject}, \textit{predicate}, \textit{object}$>$. Most existing methods focus on capturing temporal dependencies, but overlook crucial visual relationship dependencies between entities and predicates, as well as among predicate subclasses. These dependencies are essential for a deeper contextual understanding of scenarios. Additionally, current approaches do not support end-to-end training and instead rely on a two-stage pipeline, which incurs higher computational costs. To address these issues, we propose an end-to-end \textbf{A}ssociation \textbf{R}easoning \textbf{N}etwork (ARN) for DSGG. ARN leverages CLIP’s semantic priors to model fine-grained triplet cues to generate scene graph. In addition, we design a Predicate Association Parsing (PAP) module that employs a conditional weight mapping mechanism to structure entity and predicate representations. We further introduce a Hierarchical Attention (HA) mechanism to integrate spatio-temporal context with entity and predicate representations, enabling effective associative reasoning. Extensive experiments on the Action Genome dataset demonstrate significant performance improvements over existing methods.
Poster
Size Wu · Wenwei Zhang · Lumin Xu · Sheng Jin · Zhonghua Wu · Qingyi Tao · Wentao Liu · Wei Li · Chen Change Loy

[ Exhibit Hall I ]

Abstract
Unifying visual understanding and generation within a single multimodal framework remains a significant challenge, as the two inherently heterogeneous tasks require representations at different levels of granularity. Current approaches that utilize vector quantization (VQ) or variational autoencoders (VAE) for unified visual representation prioritize intrinsic imagery features over semantics, compromising understanding performance. In this work, we take inspiration from masked image modelling (MIM) that learns rich semantics via a mask-and-reconstruct pre-training and its successful extension to masked autoregressive (MAR) image generation. A preliminary study on the MAR encoder's representation reveals exceptional linear probing accuracy and precise feature response to visual concepts, which indicates MAR's potential for visual understanding tasks beyond its original generation role. Based on these insights, we present Harmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder. Through a three-stage training procedure that progressively optimizes understanding and generation capabilities, Harmon achieves state-of-the-art image generation results on the GenEval (instruction alignment) and MJHQ30K (visual quality) benchmarks while matching the performance of methods with dedicated semantic encoders (e.g., Janus) on image understanding benchmarks. Our code and models will be released.
Poster
Zhiyuan Fang · Rengan Xie · Xuancheng Jin · Qi Ye · Wei Chen · Wenting Zheng · Rui Wang · Yuchi Huo

[ Exhibit Hall I ]

Abstract
Recently, the field of 3D scene stylization has attracted considerable attention, particularly for applications in the metaverse. A key challenge is rapidly transferring the style of an arbitrary reference image to a 3D scene while faithfully preserving its content structure and spatial layout. Works leveraging implicit representations with gradient-based optimization achieve impressive style transfer results, yet the lengthy processing time per individual style makes rapid switching impractical. In this paper, we propose A$^3$GS, a novel feed-forward neural network for zero-shot 3DGS stylization that enables transferring any image style to arbitrary 3D scenes in just 10 seconds without the need for per-style optimization. Our work introduces a Graph Convolutional Network (GCN)-based autoencoder aimed at efficient feature aggregation and decoding of spatially structured 3D Gaussian scenes. The encoder converts 3DGS scenes into a latent space. Furthermore, for the latent space, we utilize Adaptive Instance Normalization (AdaIN) to inject features from the target style image into the 3D Gaussian scene. Finally, we constructed a 3DGS dataset using a generative model and proposed a two-stage training strategy for A$^3$GS. Owing to the feed-forward design, our framework can perform fast style transfer on large-scale 3DGS scenes, which poses a severe challenge to the memory consumption …
Poster
Ata Çelen · Iro Armeni · Daniel Barath · Marc Pollefeys

[ Exhibit Hall I ]

Abstract
We introduce HouseTour, a method for spatially-aware 3D camera trajectory and natural language summary generation from a collection of images depicting an existing 3D space. Unlike existing vision-language models (VLMs), which struggle with geometric reasoning, our approach generates smooth video trajectories via a diffusion process constrained by known camera poses and integrates this information into the VLM for 3D-grounded descriptions. We synthesize the final video using 3D Gaussian splatting to render novel views along the trajectory. To support this task, we present the HouseTour dataset, which includes over 1,200 house-tour videos with camera poses, 3D reconstructions, and real estate descriptions. Experiments demonstrate that incorporating 3D camera trajectories into the text generation process improves performance over methods handling each task independently. We evaluate both individual and end-to-end performance, introducing a new joint metric. Our work enables automated, professional-quality video creation for real estate and touristic applications without requiring specialized expertise or equipment.
Poster
Mingqi Fang · Ziguang Li · Lingyun Yu · Quanwei Yang · Hongtao Xie · Yongdong Zhang

[ Exhibit Hall I ]

Abstract
Recently, synthetic images have evolved incredibly realistic with the development of generative techniques.To avoid the spread of misinformation and identify synthetic content, research on synthetic image detection becomes urgent. Unfortunately, limited to the singular forensic perspective, existing methods struggle to explore sufficient traces encountered with diverse synthetic techniques. In response to this, we argue that different synthetic images encompass a variety of forensic traces, and utilizing multiple experts to explore traces from diverse perspectives will be beneficial. Accordingly, a novel detector with the **M**ixture **o**f multiple forensic **E**xperts is proposed, named **Forensic-MoE**. To integrate multiple experts and enhance the knowledge interaction, Forensic-MoE follows an adapter-backbone architecture. Specifically, multiple adapters trained on different synthetic images serve as the trace exploration experts, and they are uniformly integrated into a pretrained backbone model to learn the detection prior and encourage the expert interaction. By guiding multiple experts to align with each other and collaborate together, Forensic-MoE can integrate comprehensive and discriminative detection traces from multiple perspectives. Moreover, for the discrimination improvement of each expert, a multi-stage structure is proposed for efficient trace perception, and a patch decentralization strategy is applied to encourage the model's attention on every local region. Extensive experiments demonstrate the …
Poster
Tomoyuki Suzuki · Kang-Jun Liu · Naoto Inoue · Kota Yamaguchi

[ Exhibit Hall I ]

Abstract
Designers craft and edit graphic designs in a layer representation, but layer-based editing becomes impossible once composited into a raster image.In this work, we propose LayerD, a method to decompose raster graphic designs into layers for re-editable creative workflow.LayerD addresses the decomposition task by iteratively extracting unoccluded foreground layers and completing the background.We propose a simple yet effective refinement approach taking advantage of the assumption that layers often exhibit uniform appearance in graphic designs.As decomposition is ill-posed and ground-truth layer structure may not be reliable, we develop a metric that measures the quality of the decomposition.In experiments, we show that LayerD successfully achieves high quality decomposition and outperforms baselines.We also demonstrate the use of LayerD with state-of-the-art image generators and layer-based editing.
Poster
Jonas Belouadi · Eddy Ilg · Margret Keuper · Hideki Tanaka · Masao Utiyama · Raj Dabre · Steffen Eger · Simone Paolo Ponzetto

[ Exhibit Hall I ]

Abstract
With the rise of generative AI, synthesizing figures from text captions becomes a compelling application. However, achieving high geometric precision and editability requires representing figures as graphics programs in languages like Ti*k*Z, and aligned training data (i.e., graphics programs with captions) remains scarce. Meanwhile, large amounts of unaligned graphics programs and captioned raster images are more readily available. We reconcile these disparate data sources by presenting Ti*k*Zero, which decouples graphics program generation from text understanding by using image representations as an intermediary bridge. It enables independent training on graphics programs and captioned images and allows for zero-shot text-guided graphics program synthesis during inference. We show that our method substantially outperforms baselines that can only operate with caption-aligned graphics programs. Furthermore, when leveraging caption-aligned graphics programs as a complementary training signal, Ti*k*Zero matches or exceeds the performance of much larger models, including commercial systems like GPT-4o. Our code, datasets, and select models will be made publicly available.
Poster
Marc Lafon · Yannis Karmim · Julio Silva-Rodríguez · Paul Couairon · Clément Rambour · Raphael Fournier-Sniehotta · Ismail Ayed · Jose Dolz · Nicolas THOME

[ Exhibit Hall I ]

Abstract
Reliable Uncertainty Quantification (UQ) and failure prediction remain open challenges for Vision-Language Models (VLMs). We introduce ViLU, a new Vision-Language Uncertainty quantification framework that contextualizes uncertainty estimates by leveraging all task-relevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via cross-attention. Unlike traditional UQ methods based on loss prediction, ViLU trains an uncertainty predictor as a binary classifier to distinguish correct from incorrect predictions using a weighted binary cross-entropy loss, making it loss-agnostic. In particular, our proposed approach is well-suited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself. Extensive experiments on diverse datasets show the significant gains of our method compared to state-of-the-art failure prediction methods. We apply our method to standard classification datasets, such as ImageNet-1k, as well as large-scale image-caption datasets like CC12M and LAION-400M. Ablation studies highlight the critical role of our architecture and training in achieving effective uncertainty quantification.
Poster
Grzegorz Gruszczynski · Jakub Meixner · Michał Włodarczyk · Przemyslaw Musialski

[ Exhibit Hall I ]

Abstract
We propose a novel PDE-driven corruption process for generative image synthesis based on advection-diffusion processes which generalizes existing PDE-based approaches. Our forward pass formulates image corruption via a physically motivated PDE that couples directional advection with isotropic diffusion and Gaussian noise, controlled by dimensionless numbers (Péclet, Fourier). We implement this PDE numerically through a GPU-accelerated custom Lattice Boltzmann solver for fast evaluation. To induce realistic ``turbulence,'' we generate stochastic velocity fields that introduce coherent motion and capture introduce multi-scale mixing. A diffusion model then learns to invert the advection-diffusion operator, reconstructing fine details from coarsely transported images and thus constituting a novel generative diffusion model. We discuss how previews methods emerge as specific cases (zero velocity or zero blur) of our operator, demonstrating that our advection-diffusion framework generalizes prior PDE-based diffusion techniques. This work bridges fluid dynamics, dimensionless PDE theory, and deep generative modeling, offering a fresh perspective on physically informed image corruption processes for diffusion-based synthesis.
Poster
Ziyi Liu · Zhe Xu · Jiabo MA · Wenqiang Li · Ruixuan Wang · Bo Du · Hao Chen

[ Exhibit Hall I ]

Abstract
Pathological image has been recognized as the gold standard for cancer diagnosis for more than a century. However, some internal regions of pathological images may inevitably exhibit various degradation issues, including low resolution, image blurring, and image noising, which will affect disease diagnosis, staging, and risk stratification.Existing pathological image restoration methods were mainly based on generative adversarial networks (GANs) to improve image quality, which are limited by the inherent instability and loss of structural details, often resulting in artifacts in the restored images.Large scale of whole slide images (WSIs) also makes it hard for efficient processing and restoration. To address these limitations, we propose a conditional visual autoregressive model (CVARPath) for next-scale token prediction, guided by the degraded tokens from the current scale. We introduce a novel framework that employs quantified encoders specifically designed for pathological image generation, which learns consistent sparse vocabulary tokens through self-supervised contrastive learning. Furthermore, our method efficiently compresses image patches into compact degraded sparse tokens at smaller scales and reconstructs high-quality large-scale whole slide images (WSIs). This is achieved using only an 8×8 vocabulary index for 256×256 images while maintaining minimal reconstruction loss. Experimental results demonstrate that our approach significantly enhances image quality, achieving an …
Poster
Haoyang Chen · Dongfang Sun · Caoyuan Ma · Shiqin Wang · Kewei Zhang · Zheng Wang · Zhixiang Wang

[ Exhibit Hall I ]

Abstract
We propose Subjective Camera, a human-as-imaging-device paradigm that reconstructs real-world scenes from mental impressions through synergistic use of verbal descriptions and progressive rough sketches. This approach overcomes dual limitations of language ambiguity and sketch abstraction by treating the user's drawing sequence as priors, effectively translating subjective perceptual expectations into photorealistic images.Existing approaches face three fundamental barriers: (1) user-specific subjective input biases, (2) huge modality gap between planar sketch and 3D priors in diffusion, and (3) sketch quality-sensitive performance degradation. Current solutions either demand resource-intensive model adaptation or impose impractical requirements on sketch precision.Our framework addresses these challenges through concept-sequential generation. (1) We establish robust appearance priors through text-reward optimization, and then implement sequence-aware disentangled generation that processes concepts in sketching order; these steps accommodate user-specific subjective expectation in a train-free way. (2) We employ latent optimization that effectively bridges the modality gap between planar sketches and 3D priors in diffusion. (3) Our hierarchical reward-guided framework enables the use of rough sketches without demanding artistic expertise. Comprehensive evaluation across diverse datasets demonstrates that our approach achieves state-of-the-art performance in maintaining both semantic and spatial coherence.
Poster
Wenxue Li · Tian Ye · Xinyu Xiong · Jinbin Bai · feilong tang · Wenxuan Song · Zhaohu Xing · Lie Ju · Guanbin Li · Lei Zhu

[ Exhibit Hall I ]

Abstract
Glass Surface Detection (GSD) is a critical task in computer vision, enabling precise interactions with transparent surfaces and enhancing both safety and object recognition accuracy. However, current research still faces challenges in both recognition performance and generalization capability. Thanks to the recent advanced diffusion-based generative models, GSD task can benefit from rich prior knowledge encapsulated in pre-trained Stable Diffusion (SD) model. Thus, in this paper, we present GlassWizard, aiming to harvest priors in diffusion-based model to achieve accurate and generalized GSD. Firstly, we delve into the text embedding space in SD to build an text-based context prior, thereby enhancing the understanding of implicit attribute of glass and achieving fine-grained predictions. Secondly, we train an end-to-end diffusion model with a one-step formulation pipeline, yielding effective optimization and fast inference. In addition, to facilitate our adapted framework scalable to other multi-modal GSD tasks (such as RGB-D/RGB-T GSD), we present a modality-customized adaptation, that can achieve rapid adaptation to multi-modal GSD tasks. Our experimental results demonstrate that our proposed framework achieves cutting-edge performance across diverse datasets, and it also shows strong generalization ability. Additionally, it excels in multi-modal GSD tasks, confirming its scalability across different modalities. The code will be publicly released.
Poster
Chen Yi Lu · Mehrab Tanjim · Ishita Dasgupta · Somdeb Sarkhel · Gang Wu · Saayan Mitra · Somali Chaterji

[ Exhibit Hall I ]

Abstract
We present SKALD, a multi-shot video assembly method that constructs coherent video sequences from candidate shots with minimal reliance on text. Central to our approach is the Learned Clip Assembly (LCA) score, a learning-based metric that measures temporal and semantic relationships between shots to quantify narrative coherence. We tackle the exponential complexity of combining multiple shots with an efficient beam-search algorithm guided by the LCA score. To train our model effectively with limited human annotations, we propose two tasks for the LCA encoder: Shot Coherence Learning, which uses contrastive learning to distinguish coherent and incoherent sequences, and Feature Regression, which converts these learned representations into a real-valued coherence score. We develop two variants: a base SKALD model that relies solely on visual coherence and SKALD-text, which integrates auxiliary text information when available. Experiments on the VSPD and our curated MSV3C datasets show that SKALD achieves an improvement of up to 48.6% in IoU and a 43% speedup over the state-of-the-art methods. A user study further validates our approach, with 45% of participants favoring SKALD-assembled videos, compared to 22% preferring text-based assembly methods.
Poster
Mainak Singha · Subhankar Roy · Sarthak Mehrotra · Ankit Jha · Moloud Abdar · Biplab Banerjee · Elisa Ricci

[ Exhibit Hall I ]

Abstract
Textual prompt tuning adapts Vision-Language Models (e.g., CLIP) in federated learning by tuning lightweight input tokens (or prompts) on local client data, while keeping network weights frozen. Post training, only the prompts are shared by the clients with the central server for aggregation. However, textual prompt tuning often struggles with overfitting to known concepts and may be overly reliant on memorized text features, limiting its adaptability to unseen concepts. To address this limitation, we propose Federated Multimodal Visual Prompt Tuning (FedMVP) that conditions the prompts on comprehensive contextual information -- image-conditioned features and textual attribute features of a class -- that is multimodal in nature. At the core of FedMVP is a PromptFormer module that synergistically aligns textual and visual features through cross-attention, enabling richer contexual integration. The dynamically generated multimodal visual prompts are then input to the frozen vision encoder of CLIP, and trained with a combination of CLIP similarity loss and a consistency loss. Extensive evaluation on 20 datasets spanning three generalization settings demonstrates that \method not only preserves performance on in-distribution classes and domains, but also displays higher generalizability to unseen classes and domains when compared to state-of-the-art methods.
Poster
Mert Sonmezer · Matthew Zheng · Pinar Yanardag

[ Exhibit Hall I ]

Abstract
Low-rank Adaptation (LoRA) models have revolutionized the personalization of pre-trained diffusion models by enabling fine-tuning through low-rank, factorized weight matrices specifically optimized for attention layers. These models facilitate the generation of highly customized content across a variety of objects, individuals, and artistic styles without the need for extensive retraining. Despite the availability of over 100K LoRA adapters on platforms like Civit.ai, users often face challenges in navigating, selecting, and effectively utilizing the most suitable adapters due to their sheer volume, diversity, and lack of structured organization. This paper addresses the problem of selecting the most relevant and diverse LoRA models from this vast database by framing the task as a combinatorial optimization problem and proposing a novel submodular framework. Our quantitative and qualitative experiments demonstrate that our method generates diverse outputs across a wide range of domains.
Poster
Yi-Hsin Chen · Yi-Chen Yao · Kuan-Wei Ho · Chun-Hung Wu · Huu-Tai Phung · Martin Benjak · Jörn Ostermann · Wen-Hsiao Peng

[ Exhibit Hall I ]

Abstract
Most frame-based learned video codecs can be interpreted as recurrent neural networks (RNNs) propagating reference information along the temporal dimension. This work revisits the limitations of the current approaches from an RNN perspective. The output-recurrence methods, which propagate decoded frames, are intuitive but impose dual constraints on the output decoded frames, leading to suboptimal rate-distortion performance. In contrast, the hidden-to-hidden connection approaches, which propagate latent features within the RNN, offer greater flexibility but require large buffer sizes. To address these issues, we propose HyTIP, a learned video coding framework that combines both mechanisms. Our hybrid buffering strategy uses explicit decoded frames and a small number of implicit latent features to achieve competitive coding performance. Experimental results show that our HyTIP outperforms the sole use of either output-recurrence or hidden-to-hidden approaches. Furthermore, it achieves comparable performance to state-of-the-art methods but with a much smaller buffer size, and outperforms VTM 17.0 (Low-delay B) in terms of PSNR-RGB and MS-SSIM-RGB.
Poster
Kazuma Nagata · Naoshi Kaneko

[ Exhibit Hall I ]

Abstract
Automatic colorization methods for line drawings have been widely studied to reduce the labor cost of hand-drawn anime production. Deep learning approaches, including image/video generation and feature-based correspondence, have improved accuracy but struggle with occlusions, pose variations, and viewpoint changes. To address these challenges, we propose DACoN, a framework that leverages foundation models to capture part-level semantics, even in line drawings. Our method fuses low-resolution semantic features from foundation models with high-resolution spatial features from CNNs for fine-grained yet robust feature extraction.In contrast to previous methods that rely on Multiplex Transformer and support only one or two reference images, DACoN removes this constraint, allowing any number of references. Quantitative evaluations demonstrate the advantages of using multiple reference images, achieving superior colorization performance.Our code and model will be released upon acceptance.
Poster
Divyansh Srivastava · Xiang Zhang · He Wen · Chenru Wen · Zhuowen Tu

[ Exhibit Hall I ]

Abstract
We present Lay-Your-Scene (shorthand LayouSyn, a novel text to layout generation pipeline for natural scenes. Prior scene layout generation methods are either closed-vocabulary or use proprietary large language models for open-vocabulary generation, limiting their modeling capabilities and broader applicability in controllable image generation. In this work, we propose to use lightweight open-source language models to obtain scene elements from text prompts and a novel aspect-aware Diffusion-Transformer architecture trained in an open-vocabulary manner for conditional layout generation. Extensive experiments demonstrate that LayouSyn outperforms existing methods and achieves state-of-the-art performance on challenging spatial and numerical reasoning benchmarks. Additionally, we present two applications of LayouSyn: First, we show that coarse initialization from large language models can be seamlessly combined with our method to achieve better results. Second, we present a pipeline for adding objects to images, demonstrating the potential of LayouSyn in image editing applications.
Poster
Jaemin Kim · Bryan Sangwoo Kim · Jong Ye

[ Exhibit Hall I ]

Abstract
Diffusion models have achieved impressive results in generative tasks for text-to-video (T2V) synthesis. However, achieving accurate text alignment in T2V generation remains challenging due to the complex temporal dependencies across frames. Existing reinforcement learning (RL)-based approaches to enhance text alignment often require differentiable reward functions trained for video, hindering their scalability and applicability. In this paper, we propose Free$^2$Guide, a novel gradient-free and training-free framework for aligning generated videos with text prompts. Specifically, leveraging principles from path integral control, Free$^2$Guide approximates guidance for diffusion models using non-differentiable reward functions, thereby enabling the integration of powerful black-box Large Vision-Language Models (LVLMs) as reward models. To enable image-trained LVLMs to assess text-to-video alignment, we leverage stitching between video frames and use system prompts to capture sequential attributions. Our framework supports the flexible ensembling of multiple reward models to synergistically enhance alignment without significant computational overhead. Experimental results confirm that Free$^2$Guide using image-trained VLVMs significantly improves text-to-video alignment, thereby enhancing the overall video quality. Our results and code are available at https://free2guide.github.io/
Poster
Keming Wu · Junwen Chen · Zhanhao Liang · Yinuo Wang · Ji Li · Chao Zhang · Bin Wang · Yuhui Yuan

[ Exhibit Hall I ]

Abstract
Text-to-image generation models often struggle to interpret spatially aware text prompts effectively. To overcome this, existing approaches typically require millions of high-quality semantic layout annotations consisting of bounding boxes and regional prompts. This paper shows that the large amounts of regional prompts are non-necessary for the latest diffusion transformers like SD3 or FLUX.In this paper, we propose an efficient hybrid layout framework for diffusion transformers. Our approach drastically reduces need for extensive layout annotations and minimizes reliance on regional prompt annotations—incurring only minimal additional computational cost during inference—while maintaining high-quality layout adherence. Our key insight is to break the layout-control task into two sequential stages: first, generating the target objects within the designated regions specified by an anonymous layout, and second, refining these outputs to ensure they strictly adhere to the regional prompts in the semantic layout. Building on this insight, we propose a hybrid layout control scheme that first fine-tunes the DiTs (e.g., SD3) to follow an anonymous layout, then continues fine-tuning the DiTs to follow the semantic layout, and finally includes a quality-tuning stage to enhance visual aesthetics. We show that this hybrid design is highly data-efficient, as we find only using a small amount of semantic layout …
Poster
Tao Han · Wanghan Xu · Junchao Gong · Xiaoyu Yue · Song Guo · Luping Zhou · LEI BAI

[ Exhibit Hall I ]

Abstract
Arbitrary resolution image generation provides a consistent visual experience across devices, having extensive applications for producers and consumers. Current diffusion models increase computational demand quadratically with resolution, causing 4K image generation delays over 100 seconds. To solve this, we explore the second generation upon the latent diffusion models, where the fixed latent generated by diffusion models is regarded as the content representation and we propose to decode arbitrary resolution images with a compact generated latent using a one-step generator. Thus, we present the \textbf{InfGen}, replacing the VAE decoder with the new generator, for generating images at any resolution from a fixed-size latent without retraining the diffusion models, which simplifies the process, reducing computational complexity and can be applied to any model using the same latent space. Experiments show InfGen is capable of improving many models into the arbitrary high-resolution era while cutting 4K image generation time to under 10 seconds. The demo page, code, and pre-trained models are available at:\url{https://anonymous.4open.science/r/InfGen-7257}.
Poster
Yazhou Xing · Yang Fei · Yingqing He · Jingye Chen · Pengjun Fang · Xiaowei Chi · Qifeng Chen

[ Exhibit Hall I ]

Abstract
Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation results in temporal inconsistencies and fails to compress temporal redundancy effectively. Existing works on Video VAEs compress temporal redundancy but struggle to handle videos with large motion effectively. They suffer from issues such as severe image blur and loss of detail in scenarios with large motion. In this paper, we present a powerful video VAE named VideoVAE+ that effectively reconstructs videos with large motion. First, we investigate two architecture choices and propose our simple yet effective architecture with better spatiotemporal joint modeling performance. Second, we propose to leverage the textual information in existing text-to-video datasets and incorporate text guidance during training. The textural guidance is optional during inference. We find that this design enhances the reconstruction quality and preservation of detail. Finally, our models achieve strong performance compared with various baseline approaches in both general videos and large motion videos, demonstrating its effectiveness on the challenging large motion scenarios.
Poster
Zihan Ding · Chi Jin · Difan Liu · Haitian Zheng · Krishna Kumar Singh · Qiang Zhang · Yan Kang · Zhe Lin · Yuchen Liu

[ Exhibit Hall I ]

Abstract
Diffusion probabilistic models have shown significant progress in video generation; however, their computational efficiency is limited by the large number of sampling steps required. Reducing sampling steps often compromises video quality or generation diversity. In this work, we introduce a distillation method that combines variational score distillation and consistency distillation to achieve few-step video generation, maintaining both high quality and diversity. We also propose a latent reward model fine-tuning approach to further enhance video generation performance according to any specified reward metric. This approach reduces memory usage and does not require the reward to be differentiable. Our method demonstrates state-of-the-art performance in few-step generation for 10-second videos (128 frames at 12 FPS). The distilled student model achieves a score of 82.57 on VBench, surpassing the teacher model as well as baseline models Gen-3, T2V-Turbo, and Kling. One-step distillation accelerates the teacher model’s diffusion sampling by up to 278.6 times, enabling near real-time generation. Human evaluations further validate the superior performance of our 4-step student models compared to teacher model using 50-step DDIM sampling.
Poster
Grace Luo · Jonathan Granskog · Aleksander Holynski · Trevor Darrell

[ Exhibit Hall I ]

Abstract
Prior methods for controlling image generation are limited in their ability to be taught new tasks. In contrast, vision-language models, or VLMs, can learn tasks in-context and produce the correct outputs for a given input. We propose a dual-process distillation scheme that allows feed-forward image generators to learn new tasks from deliberative VLMs. Our scheme uses a VLM to rate the generated images and backpropagates this gradient to update the weights of the image generator. Our general framework enables a wide variety of new control tasks through the same text-and-image based interface. We showcase a handful of applications of this technique for different types of control signals, such as commonsense inferences and visual prompts. With our method, users can implement multimodal controls for properties such as color palette, line weight, horizon position, and relative depth within a matter of minutes.
Poster
Inzamamul Alam · Md Islam · Simon Woo · Khan Muhammad

[ Exhibit Hall I ]

Abstract
Watermarking embeds imperceptible patterns into images for authenticity verification. However, existing methods often lack robustness against various transformations primarily including distortions, image regeneration, and adversarial perturbation, creating real-world challenges. In this work, we introduce SpecGuard, a novel watermarking approach for robust and invisible image watermarking. Unlike prior approaches, we embed the message inside hidden convolution layers by converting from the spatial domain to the frequency domain using spectral projection of a higher frequency band that is decomposed by wavelet projection. Spectral projection employs Fast Fourier Transform approximation to transform spatial data into the frequency domain efficiently. In the encoding phase, a strength factor enhances resilience against diverse attacks, including adversarial, geometric, and regeneration-based distortions, ensuring the preservation of copyrighted information. Meanwhile, the decoder leverages Parseval’s theorem to effectively learn and extract the watermark pattern, enabling accurate retrieval under challenging transformations. We evaluate the proposed SpecGuard based on the embedded watermark's invisibility, capacity, and robustness. Comprehensive experiments demonstrate the proposed SpecGuard outperforms the state-of-the-art models.
Poster
SeungHoo Hong · GeonHo Son · Juhun Lee · Simon Woo

[ Exhibit Hall I ]

Abstract
Diffusion models have shown to be strong representation learners, showcasing state-of-the-art performance across multiple domains. Aside from accelerated sampling, DDIM also enables the inversion of real images back to their latent codes. A direct inheriting application of this inversion operation is real image editing, where the inversion yields latent trajectories to be utilized during the synthesis of the edited image. Unfortunately, this practical tool has enabled malicious users to freely synthesize misinformative or deepfake contents with greater ease, which promotes the spread of unethical and abusive, as well as privacy-, and copyright-infringing contents. While defensive algorithms such as AdvDM and Photoguard have been shown to disrupt the diffusion process on these images, the misalignment between their objectives and the iterative denoising trajectory at test time results in weak disruptive performance. In this work, we present the \textbf{D}DIM \textbf{I}nversion \textbf{A}ttack (DIA) that attacks the integrated DDIM trajectory path. Our results support the effective disruption, surpassing previous defensive methods across various editing methods. We believe that our frameworks and results can provide practical defense methods against the malicious use of AI for both the industry and the research community. Our code is available here: \url{https://anonymous.4open.science/r/DIA-13419/}.
Poster
Viet Nguyen · Anh Nguyen · Trung Dao · Khoi Nguyen · Cuong Pham · Toan Tran · Anh Tran

[ Exhibit Hall I ]

Abstract
The escalating demand for real-time image synthesis has driven significant advancements in one-step diffusion models, which inherently offer expedited generation speeds compared to traditional multi-step methods. However, this enhanced efficiency is frequently accompanied by a compromise in the controllability of image attributes. While negative prompting, typically implemented via classifier-free guidance (CFG), has proven effective for fine-grained control in multi-step models, its application to one-step generators remains largely unaddressed. Due to the lack of iterative refinement, as in multi-step diffusion, directly applying CFG to one-step generation leads to blending artifacts and diminished output quality. To fill this gap, we introduce Negative-Away Steer Attention (NASA), a training-free method that integrates negative prompts into one-step diffusion models. NASA operates within the intermediate representation space by leveraging cross-attention mechanisms to suppress undesired visual attributes. This strategy avoids the blending artifacts inherent in output-space guidance and achieves high efficiency, incurring only a minimal 1.89\% increase in FLOPs compared to the computational doubling of CFG. Furthermore, NASA can be seamlessly integrated into existing timestep distillation frameworks, enhancing the student's output quality. Experimental results demonstrate that NASA substantially improves controllability and output quality, achieving an HPSv2 score of 31.21, setting a new state-of-the-art benchmark for one-step diffusion …
Poster
Tianyang Xue · Lin Lu · Yang Liu · Mingdong Wu · Hao Dong · Yanbin Zhang · Renmin Han · Baoquan Chen

[ Exhibit Hall I ]

Abstract
2D irregular packing is a classic combinatorial optimization problem with various applications, such as material utilization and texture atlas generation. Due to its NP-hard nature, conventional numerical approaches typically encounter slow convergence and high computational costs. Previous research (GFPack) introduced a generative method for gradient-based packing, providing early evidence of its feasibility but faced limitations such as insufficient rotation support, poor boundary adaptability, and high overlap ratios. In this paper, we propose GFPack++, a deeply investigated framework that adopts attention-based geometry and relation encoding, enabling more comprehensive modeling of complex packing relationships. We further design a constrained gradient and a weighting function to enhance both the feasibility of the produced solutions and the learning effectiveness. Experimental results on multiple datasets demonstrate that GFPack++ achieves higher space utilization, supports continuous rotation, generalizes well to arbitrary boundaries, and infers orders of magnitude faster than previous approaches. We plan to release our code and datasets to advance further research in 2D irregular packing.
Poster
Ting Yao · Yehao Li · Yingwei Pan · Zhaofan Qiu · Tao Mei

[ Exhibit Hall I ]

Abstract
Autoregressive models are just at a tipping point where they could really take off for visual generation. In this paper, we propose to model token prediction using diffusion procedure particularly in masked autoregressive models for image generation. We look into the problem from two critical perspectives: progressively refining the unmasked tokens prediction via a denoising head with the autoregressive model, and representing masked tokens probability distribution by capitalizing on the interdependency across masked and unmasked tokens through a diffusion head. Our proposal harbors an innate agency that remains advantageous in the speed of sequence prediction, and strongly favors high capability in generating quality samples by leveraging the principles of denoising diffusion process. Extensive experiments on both class-conditional and text-to-image tasks demonstrate its superiority, achieving the state-of-the-art FID score of 1.47 and 5.27 on ImageNet and MSCOCO datasets, respectively. More remarkably, our approach leads to 45\% speedup in the inference time of image generation against the diffusion models such as DiT-XL/2.
Poster
Yecheng Wu · Han Cai · Junyu Chen · Zhuoyang Zhang · Enze Xie · Jincheng YU · Junsong Chen · Jinyi Hu · Yao Lu · Song Han

[ Exhibit Hall I ]

Abstract
We introduce DC-AR, a novel masked autoregressive (AR) text-to-image generation framework that delivers superior image generation quality with exceptional computational efficiency. Due to the tokenizers' limitations, prior masked AR models have lagged behind diffusion models in terms of quality or efficiency. We overcome this limitation by introducing DC-HT— a deep compression hybrid tokenizer for AR models that achieves a 32$\times$ spatial compression ratio while maintaining high reconstruction fidelity and cross-resolution generalization ability. Building upon DC-HT, we extend MaskGIT and create a new hybrid masked autoregressive image generation framework that first produces the structural elements through discrete tokens and then applies refinements via residual tokens.DC-AR achieves state-of-the-art results with a gFID of $\textbf{5.49}$ on MJHQ-30K and an overall score of $\textbf{0.69}$ on GenEval, while offering $\textbf{1.5-7.9}\times$ higher throughput and $\textbf{2.0-3.5}\times$ lower latency compared to prior leading diffusion and masked autoregressive models. We will release the code and pre-trained models upon publication.
Poster
Léopold Maillard · Tom Durand · Adrien RAMANANA RAHARY · Maks Ovsjanikov

[ Exhibit Hall I ]

Abstract
Existing generative approaches for guided image synthesis of multi-object scenes typically rely on 2D controls in the image or text space. As a result, these methods struggle to maintain and respect consistent three-dimensional geometric structure, underlying the scene. In this paper, we propose a novel conditioning approach, training method and adapter network that can be plugged into pretrained text-to-image diffusion models. Our approach provides a way to endow such models with 3D-awareness, while leveraging their rich prior knowledge. Our method supports camera control, conditioning on explicit 3D geometries and, for the first time, accounts for the entire context of a scene, i.e., both on and off-screen items, to synthesize plausible and semantically rich images. Despite its multi-modal nature, our model is lightweight, requires a reasonable number of data for supervised learning and shows remarkable generalization power. We also introduce methods for intuitive and consistent image editing and restyling, e.g., by positioning, rotating or resizing individual objects in a scene. Our method integrates well within various image creation workflows and enables a richer set of applications compared to previous approaches.
Poster
Prasen Kumar Sharma · Neeraj Matiyali · Siddharth Srivastava · Gaurav Sharma

[ Exhibit Hall I ]

Abstract
We introduce Preserve Anything, a novel method for con-trolled image synthesis that addresses key limitations in ob-ject preservation and semantic consistency in text-to-image(T2I) generation. Existing approaches often fail (i) to pre-serve multiple objects with fidelity, (ii) maintain semanticalignment with prompts, or (iii) provide explicit control overscene composition. To overcome these challenges, the pro-posed method employs an N-channel ControlNet that inte-grates (i) object preservation with size and placement ag-nosticism, color and detail retention, and artifact elimi-nation, (ii) high-resolution, semantically consistent back-grounds with accurate shadows, lighting, and prompt ad-herence, and (iii) explicit user control over background lay-outs and lighting conditions. Key components of our frame-work include object preservation and background guid-ance modules, enforcing lighting consistency and a high-frequency overlay module to retain fine details while mit-igating unwanted artifacts. We introduce a benchmarkdataset consisting of 240K natural images filtered for aes-thetic quality and 18K 3D-rendered synthetic images withmetadata such as lighting, camera angles, and object rela-tionships. This dataset addresses the deficiencies of existingbenchmarks and allows a complete evaluation. Empiricalresults demonstrate that our method achieves state-of-the-art performance, significantly improving feature-space fi-delity (FID 15.26) and semantic alignment (CLIP-S 32.85)while maintaining competitive aesthetic quality. We alsoconducted a user study to demonstrate the efficacy of theproposed work …
Poster
XUN WU · Shaohan Huang · Lingjie Jiang · Furu Wei

[ Exhibit Hall I ]

Abstract
Direct preference optimization (DPO) has shown success in aligning diffusion models with human preference. However, We identify two potential risks for existing DPO algorithms: First, current DPO methods for estimating the rewards of step-wise intermediate samples are biased, leading to inaccurate preference ordering for step-wise optimization. Second, existing DPO methods may inadvertently increase the sampling probabilities of dispreferred samples, potentially introducing application risks. To address these issues, we propose Revised Direct Preference Optimization (RDPO), a simple but effective step-wise DPO-based text-to-image diffusion model alignment method. By designing a more theoretically grounded and efficient intermediate-step reward estimation and introducing an additional regularization terms to constrain the sampling probability of dispreferred samples, RDPO can achieve more effective and stable text-to-image alignment performance. Our experiments on two datasets, with base models including Stable Diffusion v1.5 and SDXL, demonstrate that RDPO can effectively learn and construct reward signals for each step of the model, improving alignment performance while ensuring better generalization.
Poster
Carl Olsson · Yaroslava Lochman · Johan Malmport · Christopher Zach

[ Exhibit Hall I ]

Abstract
Rotation averaging is a key subproblem in applications of computer vision and robotics. Many methods for solving this problem exist, and there are also several theoretical results analyzing difficulty and optimality. However, one aspect that most of these have in common is a focus on the isotropic setting, where the intrinsic uncertainties in the measurements are not fully incorporated into the resulting optimization task. Recent empirical results suggest that moving to an anisotropic framework, where these uncertainties are explicitly included, can result in an improvement of solution quality. However, global optimization for rotation averaging has remained a challenge in this scenario.In this paper we show how anisotropic costs can be incorporated in certifiably optimal rotation averaging. We also demonstrate how existing solvers, designed for isotropic situations, fail in the anisotropic setting. Finally, we propose a stronger relaxation and show empirically that it is able to recover global optima in all tested datasets and leads to a more accurate reconstruction in all but one of the scenes.
Poster
Ava Pun · Kangle Deng · Ruixuan Liu · Deva Ramanan · Changliu Liu · Jun-Yan Zhu

[ Exhibit Hall I ]

Abstract
We introduce LegoGPT, the first approach for generating physically stable LEGO brick models from text prompts. To achieve this, we construct a large-scale, physically stable dataset of LEGO designs, along with their associated captions, and train an autoregressive large language model to predict the next brick to add via next-token prediction. To improve the stability of the resulting designs, we employ an efficient validity check and physics-aware rollback during auto-regressive inference, which prunes infeasible token predictions using physics laws and assembly constraints. Our experiments show that LegoGPT produces stable, diverse, and aesthetically pleasing LEGO designs that align closely with the input text prompts. We also develop a text-based LEGO texturing method, enabling us to generate colored and textured designs. We show that our designs can be assembled by humans manually as well as by robotic arms automatically. Upon publication, we will release our new dataset, StableText2Lego, which contains over 47,000 LEGO structures of over 28,000 unique 3D objects accompanied by detailed captions, along with our code and models.
Poster
Imad Eddine MAROUF · Enzo Tartaglione · Stéphane Lathuilière · Joost van de Weijer

[ Exhibit Hall I ]

Abstract
Continual Learning in Visual Question Answering (VQACL) requires models to acquire new visual-linguistic skills (plasticity) while preserving previously learned knowledge (stability). The inherent multimodality of VQACL exacerbates this challenge, as models must balance stability across both visual and textual domains while adapting to novel objects and reasoning tasks. Existing methods, primarily designed for unimodal settings, often fall short in addressing this dual requirement. In this work, we present QUestion-only replay with Attention Distillation (QUAD), a novel approach for VQACL that leverages only past task questions for regularization. By eliminating the need to store visual data, QUAD not only reduces memory overhead, but also alleviates privacy concerns. Our method introduces a Question-only Replay mechanism that selectively reuses prior task questions to counteract overfitting to the current task’s answer space, addressing the out-of-answer-set problem. Complementing this, we propose Attention Consistency Distillation to enforce both intra-modal and inter-modal attention consistency across tasks, preserving essential visual-linguistic associations. Extensive experiments on VQAv2 and NExT-QA demonstrate that QUAD significantly outperforms state-of-the-art methods, achieving robust performance in continual VQA. The source code, provided in the supplementary material, will be publicly released upon acceptance.
Poster
Tuna Meral · Enis Simsar · Federico Tombari · Pinar Yanardag

[ Exhibit Hall I ]

Abstract
Low-Rank Adaptation (LoRA) has emerged as a powerful and popular technique for personalization, enabling efficient adaptation of pre-trained image generation models for specific tasks without comprehensive retraining. While employing individual pre-trained LoRA models excels at representing single concepts, such as those representing a specific dog or a cat, utilizing multiple LoRA models to capture a variety of concepts in a single image still poses a significant challenge. Existing methods often fall short, primarily because the attention mechanisms within different LoRA models overlap, leading to scenarios where one concept may be completely ignored (e.g., omitting the dog) or where concepts are incorrectly combined (e.g., producing an image of two cats instead of one cat and one dog). We introduce CLoRA, a training-free approach that addresses these limitations by updating the attention maps of multiple LoRA models at test-time, and leveraging the attention maps to create semantic masks for fusing latent representations. This enables the generation of composite images that accurately reflect the characteristics of each LoRA. Our comprehensive qualitative and quantitative evaluations demonstrate that CLoRA significantly outperforms existing methods in multi-concept image generation using LoRAs.
Poster
Hui Li

[ Exhibit Hall I ]

Abstract
Generative AI (GenAI), which revolutionized both computer vision and natural language processing, has drawn continuous attention recently. Benefits from GenAI with the evolution of large language models (LLMs), the image generation task evolved from prompt-based to dialogue-based, which takes the real-world human intent expressed through conversations. When breaking this task into multiple steps, the best pathway of analyzing the dialogues is not determined, such as whether the objects or prompted template should be focused on the first step of dialogues analyzing. Thus, a multi-chain reasoning is requested to decompose this application beyond a pure chain-of-thought structure. After the divergent process, the question comes to how to converge the thinking chain that leads to the best matched image, which requires a new evaluation method to lead the thinking process. To address these challenges, we propose the LLM Thought Divergence and Convergence (LTDC) framework, which simulates human cognitive processes through three phases: (1) The Step-by-Step Thought process decomposes dialogue-based image generation tasks into sequential thinking chains using LLMs; (2) The Image Generation process creates image prompts following these thought instructions and produces corresponding images; (3) The Evaluation process aligns the coherence between generated images and dialogues through a multi-modal LLM, guiding the …
Poster
Yukang Cao · Chenyang Si · Jinghao Wang · Ziwei Liu

[ Exhibit Hall I ]

Abstract
We present **FreeMorph**, the first tuning-free method for image morphing that accommodates inputs with varying semantics or layouts. Unlike existing methods, which rely on fine-tuning pre-trained diffusion models and are limited by time constraints and semantic/layout discrepancies, FreeMorph delivers high-fidelity image morphing without extensive training. Despite its efficiency and potential, tuning-free methods still face challenges in maintaining high-quality image morphing due to the non-linear nature of the multi-step denoising process and bias inherited from the pre-trained diffusion model. In this paper, we introduce FreeMorph to address this challenge by integrating two key innovations.**1)** We first propose a **guidance-aware spherical interpolation** design that incorporates the explicit guidance from the input images by modifying the self-attention modules, addressing identity loss, and ensuring directional transitions throughout the generated sequences. **2)** We further introduce a **step-oriented variation trend** that blends self-attention modules derived from each input image to achieve controlled and consistent transitions that respect both input images. Our extensive evaluations demonstrate that FreeMorph outperforms existing methods with training that is 10X ~ 50X faster, establishing a new state-of-the-art for image morphing. The code will be released.
Poster
Yabo Zhang · xinpeng zhou · Yihan Zeng · Hang Xu · Hui Li · Wangmeng Zuo

[ Exhibit Hall I ]

Abstract
Interactive image editing allows users to modify images through visual interaction operations such as drawing, clicking, and dragging. Existing methods construct such supervision signals from videos, as they capture how objects change with various physical interactions.However, these models are usually built upon text-to-image diffusion models, so necessitate (i) massive training samples and (ii) an additional reference encoder to learn real-world dynamics and visual consistency.In this paper, we reformulate this task as an image-to-video generation problem, so that inherit powerful video diffusion priors to reduce training costs and ensure temporal consistency.Specifically, we introduce FramePainter as an efficient instantiation of this formulation. Initialized with Stable Video Diffusion, it only uses a lightweight sparse control encoder to inject editing signals.Considering the limitations of temporal attention in handling large motion between two frames, we further propose matching attention to enlarge the receptive field while encouraging dense correspondence between edited and source image tokens.We highlight the effectiveness and efficiency of FramePainter across various of editing signals: it domainantly outperforms previous state-of-the-art methods with far less training data, achieving highly seamless and coherent editing of images, e.g., automatically adjust the reflection of the cup.Moreover, FramePainter also exhibits exceptional generalization in scenarios not present in real-world videos, …
Poster
Zhongdao Wang · Guodongfang Zhao · Jingjing Ren · bailan feng · Shifeng Zhang · Wenbo Li

[ Exhibit Hall I ]

Abstract
Diffusion-based generative models have demonstrated exceptional promise in super-resolution (SR) tasks, achieving a substantial advancement in detail generation relative to prior methods. However, these approaches face significant computational efficiency challenges. When the input is video, the problem becomes even more pronounced. For instance, current techniques may require tens of minutes to super-resolve a mere 2-second, 1080p video. In this paper, we present TurboVSR, an ultra-efficient diffusion-based video super-resolution model. Our core design comprises three key aspects: **(1)** We employ an autoencoder with a high compression ratio of 32$\times$32$\times$8 to reduce the number of tokens. **(2)** Highly compressed latents pose substantial challenges for training. We introduce factorized conditioning to mitigate the learning complexity: we first learn to super-resolve the initial frame; subsequently, we condition the super-resolution of the remaining frames on the high-resolution initial frame and the low-resolution subsequent frames. **(3)** We convert the pre-trained diffusion model to a shortcut model to enable fewer sampling steps, further accelerating inference.As a result, TurboVSR performs on par with state-of-the-art VSR methods, while being 100+ times faster, taking only 7 seconds to process a 2-second long 1080p video. TurboVSR also supports image resolution by considering image as a one-frame video. Our efficient design makes …
Poster
Runze He · bo cheng · Yuhang Ma · QingxiangJia QingxiangJia · Shanyuan Liu · Ao Ma · Xiaoyu Wu · Liebucha Wu · Dawei Leng · Yuhui Yin

[ Exhibit Hall I ]

Abstract
In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates, which provides significant advantages over the previous embed-and-pool operations on layout conditions, particularly when dealing with complex layouts. Unified prompting allows PlanGen to perform multitasking training related to layout, including layout planning, layout-to-image generation, image layout understanding, etc. In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling, with teacher-forcing content manipulation policy and negative layout guidance. Extensive experiments verify the effectiveness of our PlanGen in multiple layout-related tasks, showing its great potential.
Poster
Jingjing Ren · Wenbo Li · Zhongdao Wang · Haoze Sun · Bangzhen Liu · Haoyu Chen · Jiaqi Xu · Aoxue Li · Shifeng Zhang · Bin Shao · Yong Guo · Lei Zhu

[ Exhibit Hall I ]

Abstract
Demand for 2K video synthesis is rising with increasing consumer expectations for ultra-clear visuals.While diffusion transformers (DiTs) have demonstrated remarkable capabilities in high-quality video generation, scaling them to 2K resolution remains computationally prohibitive due to quadratic growth in memory and processing costs.In this work, we propose Turbo2K, an efficient and practical framework for generating detail-rich 2K videos while significantly improving training and inference efficiency. First, Turbo2K operates in a highly compressed latent space, reducing computational complexity and memory footprint, making high-resolution video synthesis feasible. However, the high compression ratio of the VAE and limited model size impose constraints on generative quality. To mitigate this, we introduce a knowledge distillation strategy that enables a smaller student model to inherit the generative capacity of a larger, more powerful teacher model. Our analysis reveals that, despite differences in latent spaces and architectures, DiTs exhibit structural similarities in their internal representations, facilitating effective knowledge transfer.Second, we design a hierarchical two-stage synthesis framework that first generates multi-level feature at lower resolutions before guiding high-resolution video generation. This approach ensures structural coherence and fine-grained detail refinement while eliminating redundant encoding-decoding overhead, further enhancing computational efficiency.Turbo2K achieves state-of-the-art efficiency, generating 5-second, 24fps, 2K videos with significantly reduced …
Poster
Taihang Hu · Linxuan Li · Kai Wang · Yaxing Wang · jian Yang · Ming-Ming Cheng

[ Exhibit Hall I ]

Abstract
Text-to-image generation has seen groundbreaking advancements with diffusion models, enabling high-fidelity synthesis and precise image editing through cross-attention manipulation. Recently, autoregressive (AR) models have re-emerged as powerful alternatives, leveraging next-token generation to match diffusion models. However, existing editing techniques designed for diffusion models fail to translate directly to AR models due to fundamental differences in structural control. Specifically, AR models suffer from spatial poverty of attention maps and sequential accumulation of structural errors during image editing, which disrupt object layouts and global consistency. In this work, we introduce Implicit Structure Locking (*ISLock*), the first training-free editing strategy for AR visual models. Rather than relying on explicit attention manipulation or fine-tuning, *ISLock* preserves structural blueprints by dynamically aligning self-attention patterns with reference images through the Anchor Token Matching (*ATM*) protocol. By implicitly enforcing structural consistency in latent space, our method *ISLock* enables structure-aware editing while maintaining generative autonomy. Extensive experiments demonstrate that *ISLock* achieves high-quality, structure-consistent edits without additional training and is superior or comparable to conventional editing techniques. Our findings pioneer the way for efficient and flexible AR-based image editing, further bridging the performance gap between diffusion and autoregressive generative models.
Poster
Xixi Hu · Runlong Liao · Bo Liu · Keyang Xu · Yeqing Li · Eugene Ie · Hongliang Fei · qiang liu

[ Exhibit Hall I ]

Abstract
Rectified Flow offers a simple and effective approach to high-quality generative modeling by learning a velocity field. However,we identify a limitation in directly modeling the velocity with an unconstrained neural network: the learned velocity often fails to satisfy certain boundary conditions, leading to inaccurate velocity field estimations that deviate from the desired ODE. This issue is particularly critical during stochastic sampling at inference, as the score function's errors are amplified near the boundary. To mitigate this, we propose a Boundary-enforced Rectified Flow Model (Boundary RF Model), in which we enforce boundary conditions with a minimal code modification. Boundary RF Model improves performance over vanilla RF model, demonstrating 8.01% improvement in FID score on ImageNet using ODE sampling and 8.98% improvement using SDE sampling.
Poster
Yunze Tong · Fengda Zhang · Didi Zhu · Jun Xiao · Kun Kuang

[ Exhibit Hall I ]

Abstract
The fundamental requirement for text-to-image generation is aligning the generated images with the provided text. With large-scale data, pre-trained Stable Diffusion (SD) models have achieved remarkable performance in this task. These models process an input prompt as text control, guiding a vision model to perform denoising operations that recover a clean image from pure noise. However, we observe that when there is correlation among text tokens, SD’s generated images fail to accurately represent the semantics of the input prompt: simple yet crucial objects may be omitted, thereby disrupting text-image alignment. We refer to this problem as *"object omission"*. Without additional external knowledge, previous methods have been ineffective at addressing this issue. To investigate this problem, we analyze the attention maps in SD and find that biased text representations mislead the visual denoising process when handling correlated tokens, impeding object generation. Moreover, we observe that even when two prompts share the same semantics, slight variations in token sequence significantly alter attention scores, consequently affecting the final generated images. Based on these findings, we propose a simple yet effective fine-tuning method that applies decorrelation to the self-attention maps in the text module, thus reducing dependencies between tokens. Our approach requires no external …
Poster
Wenqi Ouyang · Zeqi Xiao · Danni Yang · Yifan Zhou · Shuai Yang · Lei Yang · Jianlou Si · Xingang Pan

[ Exhibit Hall I ]

Abstract
Generating consistent long videos is a complex challenge: while diffusion-based generative models generate visually impressive short clips, extending them to longer durations often leads to memory bottlenecks and long-term inconsistency. In this paper, we propose TokensGen, a novel two-stage framework that leverages condensed tokens to address these issues. Our method decomposes long video generation into three core tasks: (1) inner-clip semantic control, (2) long-term consistency control, and (3) inter-clip smooth transition. First, we train To2V (Token-to-Video), a short video diffusion model guided by text and video tokens, with a Video Tokenizer that condenses short clips into semantically rich tokens. Second, we introduce T2To (Text-to-Token), a video token diffusion transformer that generates all tokens at once, ensuring global consistency across clips. Finally, during inference, an adaptive FIFO-Diffusion strategy seamlessly connects adjacent clips, reducing boundary artifacts and enhancing smooth transitions.Experimental results demonstrate that our approach significantly enhances long-term temporal and content coherence without incurring prohibitive computational overhead. By leveraging condensed tokens and pre-trained short video models, our method provides a scalable, modular solution for long video generation, opening new possibilities for storytelling, cinematic production, and immersive simulations.
Poster
Haoming Cai · Tsung-Wei Huang · Shiv Gehlot · Brandon Feng · Sachin Shah · Guan-Ming Su · Christopher Metzler

[ Exhibit Hall I ]

Abstract
Text-to-image diffusion models excel at generating diverse portraits, but lack intuitive shadow control. Existing editing approaches, as post-processing, struggle to offer effective manipulation across diverse styles. Additionally, these methods either rely on expensive real-world light-stage data collection or require extensive computational resources for training. To address these limitations, we introduce Shadow Director, a method that extracts and manipulates hidden shadow attributes within well-trained diffusion models. Our approach uses a small estimation network that requires only a few thousand synthetic images and hours of training—no costly real-world light-stage data needed. Shadow Director enables parametric and intuitive control over shadow shape, placement, and intensity during portrait generation while preserving artistic integrity and identity across diverse styles. Despite training only on synthetic data built on real-world identities, it generalizes effectively to generated portraits with diverse styles, making it a more accessible and resource-friendly solution.
Poster
Yadong Qu · Shancheng Fang · Yuxin Wang · Xiaorui Wang · Zhineng Chen · Hongtao Xie · Yongdong Zhang

[ Exhibit Hall I ]

Abstract
Graphic design visually conveys information and data by creating and combining text, images and graphics. Two-stage methods that rely primarily on layout generation lack creativity and intelligence, making graphic design still labor-intensive. Existing diffusion-based methods generate non-editable graphic design files at image level with poor legibility in visual text rendering, which prevents them from achieving satisfactory and practical automated graphic design. In this paper, we propose Instructional Graphic Designer (IGD) to swiftly generate multimodal layers with editable flexibility with only natural language instructions. IGD adopts a new paradigm that leverages parametric rendering and image asset generation. First, we develop a design platform and establish a standardized format for multi-scenario design files, thus laying the foundation for scaling up data. Second, IGD utilizes the multimodal understanding and reasoning capabilities of MLLM to accomplish attribute prediction, sequencing and layout of layers. It also employs a diffusion model to generate image content for assets. By enabling end-to-end training, IGD architecturally supports scalability and extensibility in complex graphic design tasks. Notably, IGD is the first method to combine creativity with the ability to generate editable multimodal layers. The superior experimental results demonstrate that IGD offers a new solution for graphic design.
Poster
Mengchen Zhang · Tong Wu · Jing Tan · Ziwei Liu · Gordon Wetzstein · Dahua Lin

[ Exhibit Hall I ]

Abstract
Camera trajectory design plays a crucial role in video production, serving as a fundamental tool for conveying directorial intent and enhancing visual storytelling. In cinematography, Directors of Photography meticulously craft camera movements to achieve expressive and intentional framing. However, existing methods for camera trajectory generation remain limited: Traditional approaches rely on geometric optimization or handcrafted procedural systems, while recent learning-based methods often inherit structural biases or lack textual alignment, constraining creative synthesis. In this work, we introduce an auto-regressive model inspired by the expertise of Directors of Photography to generate artistic and expressive camera trajectories. We first introduce DataDoP, a large-scale multi-modal dataset containing 29K real-world shots with free-moving camera trajectories, depth maps, and detailed captions in specific movements, interaction with the scene, and directorial intent. Thanks to the comprehensive and diverse database, we further train an auto-regressive, decoder-only Transformer for high-quality, context-aware camera movements generation based on text guidance and RGBD inputs, named GenDoP. Extensive experiments demonstrate that compared to existing methods, GenDoP offers better controllability, finer-grained trajectory adjustments, and higher motion stability. We believe our approach establishes a new standard for learning-based cinematography, paving the way for future advancements for camera control and filmmaking. Our code and data …
Poster
Tianrun Xu · Guanyu Chen · Ye Li · Xi Yuxin · Zeyu Mu · Ruichen Wang · Tianren Zhang · Haichuan Gao · Feng Chen

[ Exhibit Hall I ]

Abstract
Multimodal large models have made significant progress, yet fine-grained understanding of complex scenes remains a challenge. High-quality, large-scale vision-language datasets are essential for addressing this issue.However, existing methods often rely on labor-intensive manual annotations or closed-source models with optimal performance, making large-scale data collection costly. To overcome these limitations, we propose a self-bootstrapped training pipeline that leverages the model’s own multimodal capabilities to recursively refine its understanding. By decomposing existing multimodal data into localized sub-regions and generating hierarchical scene descriptions and multi-faceted question-answer pairs, we construct a 1.4M-image dataset. We further utilize this dataset to train the base model, significantly enhancing its ability to interpret complex visual scenes and perform various vision-related tasks. Our OURO model, fine-tuned on Qwen2-VL-7B-Instruct using LoRA, achieves substantial improvements over both the base model and similarly-sized counterparts across multiple multimodal benchmarks. The results demonstrate the effectiveness of our method in advancing scene understanding and multimodal reasoning. Our self-bootstrapped training pipeline offers a novel paradigm for the continuous improvement of multimodal models. Code and datasets will be released upon acceptance.
Poster
Yu-Ju Tsai · Brian Price · Qing Liu · Luis Figueroa · Daniil Pakhomov · Zhihong Ding · Scott Cohen · Ming-Hsuan Yang

[ Exhibit Hall I ]

Abstract
Recent methods for human image completion can reconstruct plausible body shapes but often fail to preserve unique details, such as specific clothing patterns or distinctive accessories, without explicit reference images. Even state-of-the-art reference-based inpainting approaches struggle to accurately capture and integrate fine-grained details from reference images. To address this limitation, we propose CompleteMe, a novel reference-based human image completion framework. CompleteMe employs a dual U-Net architecture combined with a Region-focused Attention (RFA) Block, which explicitly guides the model's attention toward relevant regions in reference images. This approach effectively captures fine details and ensures accurate semantic correspondence, significantly improving the fidelity and consistency of completed images. Additionally, we introduce a challenging benchmark specifically designed for evaluating reference-based human image completion tasks. Extensive experiments demonstrate that our proposed method achieves superior visual quality and semantic consistency compared to existing techniques.
Poster
Xingjian Leng · Jaskirat Singh · Yunzhong Hou · Zhenchang Xing · Saining Xie · Liang Zheng

[ Exhibit Hall I ]

Abstract
In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training for both VAE and diffusion-model using standard diffusion-loss is ineffective, causing the VAE to converge to trivial solutions and degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss $-$ allowing both encoder and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over $17\times$ and $45\times$ over REPA and vanilla training recipes, respectively. Interestingly, we observe that once tuned from the end-to-end training, the VAE can be reused for downstream generation tasks; exhibiting significantly accelerated generation performance across diverse diffusion architectures and training settings.
Poster
Xuan-Hao Liu · Bao-liang Lu · Wei-Long Zheng

[ Exhibit Hall I ]

Abstract
Generating high fidelity video from brain activity is an important milestone in brain decoding research. Previous works were mostly based on functional Magnetic Resonance Imaging (fMRI), whose low temporal resolution confines the ability of faithfully reflecting rapid brain activity, motivating us to turn to high temporal resolution brain signals like electroencephalography (EEG). However, EEG-to-video is challenging due to the complexity and nonstationarity of EEG signals and the scarcity of data annotations. Addressing these issues, we present **EEGMirror**. Firstly, we adopt neural quantization for converting nonstationary raw EEG signals into robust discrete representation. Afterwards, a masked self-supervision method with montage-agnostic position embedding (MAPE) is introduced. By MAPE, EEGMirror can process EEG data with various montages (number and position of channels) and thus can flexibly leverage different EEG datasets to acquire an effective EEG encoder, mitigating the lack of well-annotated EEG data. Next, multimodal contrastive learning is applied to align brain modality with dynamic changes and semantic information. Lastly, a fine-tuned inflated Stable Diffusion model is adopted to reconstruct video stimuli guided by visual and semantic information decoded from EEG signals. We show that EEGMirror outperforms the state-of-the-art performance in both semantic (82.1\% vs 79.8\%) and pixel (0.261 vs 0.256) levels. An …
Poster
shangwen zhu · Han Zhang · Zhantao Yang · Qianyu Peng · Zhao Pu · Huangji Wang · Fan Cheng

[ Exhibit Hall I ]

Abstract
Text-based diffusion models have made significant breakthroughs in generating high-quality images and videos from textual descriptions. However, the lengthy sampling time of the denoising process remains a significant bottleneck in practical applications. Previous methods either ignore the statistical relationships between adjacent steps or rely on attention or feature similarity between them, which often only works with specific network structures. To address this issue, we discover a new statistical relationship in the transition operator between adjacent steps, focusing on the relationship of the outputs from the network. This relationship does not impose any requirements on the network structure. Based on this observation, we propose a novel $\textbf{training-free}$ acceleration method called LTC-Accel, which uses the identified relationship to estimate the current transition operator based on adjacent steps. Due to no specific assumptions regarding the network structure, LTC-Accel is applicable to almost all diffusion-based methods and orthogonal to almost all existing acceleration techniques, making it easy to combine with them. Experimental results demonstrate that LTC-Accel significantly speeds up sampling in text-to-image and text-to-video synthesis while maintaining competitive sample quality. Specifically, LTC-Accel achieves a speedup of $\mathbf{1.67\times}$ in Stable Diffusion v2 and a speedup of $\mathbf{1.55\times}$ in video generation models. When combined with distillation …
Poster
Zerui Gong · Zhonghua Wu · Qingyi Tao · Qinyue Li · Chen Change Loy

[ Exhibit Hall I ]

Abstract
Photorealistic style transfer (PST) enables real-world color grading by adapting reference image colors while preserving content structure.Existing methods mainly follow either approaches: generation-based methods that prioritize stylistic fidelity at the cost of content integrity and efficiency, or global color transformation methods such as LUT, which preserve structure but lack local adaptability. To bridge this gap, we propose Spatial Adaptive 4D Look-Up Table (SA-LUT), combining LUT efficiency with neural network adaptability. SA-LUT features: (1) a Style-guided 4D LUT Generator that extracts multi-scale features from the style image to predict a 4D LUT, and (2) a Context Generator using content-style cross-attention to produce a context map. This context map enables spatially-adaptive adjustments, allowing our 4D LUT to apply precise color transformations while preserving structural integrity. To establish a rigorous evaluation framework for photorealistic style transfer, we introduce PST50, the first benchmark specifically designed for PST assessment. Experiments demonstrate that SA-LUT substantially outperforms state-of-the-art methods, achieving a 66.7% reduction in LPIPS score compared to 3D LUT approaches, while maintaining real-time performance at 16 FPS for video stylization. Code and benchmark will be released.
Poster
Jingyi Lu · Kai Han

[ Exhibit Hall I ]

Abstract
Drag-based image editing has emerged as a powerful paradigm for intuitive image manipulation. However, existing approaches predominantly rely on manipulating the latent space of generative models, leading to limited precision, delayed feedback, and model-specific constraints. Accordingly, we present Inpaint4Drag, a novel framework that decomposes drag-based editing into pixel-space bidirectional warping and image inpainting. Inspired by elastic object deformation in the physical world, we treat image regions as deformable materials that maintain natural shape under user manipulation. Our method achieves real-time warping previews (0.01s) and efficient inpainting (0.3s) at 512×512 resolution, significantly improving the interaction experience compared to existing methods that require minutes per edit. By transforming drag inputs directly into standard inpainting formats, our approach serves as a universal adapter for any inpainting model without architecture modification, automatically inheriting all future improvements in inpainting technology. Extensive experiments demonstrate that our method achieves superior visual quality and precise control while maintaining real-time performance.
Poster
Songhua Liu · Ruonan Yu · Xinchao Wang

[ Exhibit Hall I ]

Abstract
Given a source image, personalized text-to-image generation produces images preserving the identity and appearance while following the text prompts. Existing methods heavily rely on test-time optimization to achieve this customization. Although some recent works are dedicated to zero-shot personalization, they still require re-training when applied to different text-to-image diffusion models. In this paper, we instead propose a model-agnostic personalized method termed UniversalBooth. At the heart of our approach lies a novel cross-attention mechanism, where different blocks in the same diffusion scale share common square mappings for key and value, which decouples the image feature encoder from the diffusion architecture while maintaining its effectiveness. Moreover, the cross-attention performs hierarchically: the holistic attention first captures the global semantics of user inputs for textual combination with editing prompts, and the fine-grained attention divides the holistic attention scores for various local patches to enhance appearance consistency. To improve the performance when deployed on unseen diffusion models, we further devise an optimal transport prior to the model and encourage the attention scores allocated by cross-attention to fulfill the optimal transport constraint. Experiments demonstrate that our personalized generation model can be generalized to unseen text-to-image diffusion models with a wide spectrum of architectures and functionalities without …
Poster
Haoxuan Wang · Jinlong Peng · Qingdong He · Hao Yang · Ying Jin · Jiafu Wu · Xiaobin Hu · Yanjie Pan · Zhenye Gan · Mingmin Chi · Bo Peng · Yabiao Wang

[ Exhibit Hall I ]

Abstract
With the rapid development of diffusion models in image generation, the demand for more powerful and flexible controllable frameworks is increasing. Although existing methods can guide generation beyond text prompts, the challenge of effectively combining multiple conditional inputs while maintaining consistency with all of them remains unsolved. To address this, we introduce UniCombine, a DiT-based multi-conditional controllable generative framework capable of handling any combination of conditions, including but not limited to text prompts, spatial maps, and subject images. Specifically, we introduce a novel Conditional MMDiT Attention mechanism and incorporate a trainable LoRA module to build both the training-free and training-based versions. Additionally, we propose a new pipeline to construct SubjectSpatial200K, the first dataset designed for multi-conditional generative tasks covering both the subject-driven and spatially-aligned conditions. Extensive experimental results on multi-conditional generation demonstrate the outstanding universality and powerful capability of our approach with state-of-the-art performance. Our code and dataset will be released soon.
Poster
Yuanrui Wang · Cong Han · Yafei Li · Zhipeng Jin · Xiawei Li · Sinan Du · Wen Tao · Yi Yang · shuanglong li · Chun Yuan · LIU LIN

[ Exhibit Hall I ]

Abstract
Text-to-image generation has transformed content creation, yet precise visual text rendering remains challenging for generative models due to blurred glyphs, semantic inconsistencies, and limited style controllability. Current methods typically employ pre-rendered glyph images as conditional inputs, but their inability to preserve original font styles and color information forces reliance on multi-branch architectures to compensate for missing details. This leads to increased model complexity, higher computational costs, and reduced reusability.To address these limitations, we propose a segmentation-guided framework that leverages pixel-level visual text segmentation masks—complete representations preserving glyph shapes, colors, and spatial details—as unified conditional inputs. Our approach integrates two key innovations: (1) a fine-tuned bilingual segmentation model for extracting precise text masks from source images, and (2) a streamlined diffusion model enhanced with adaptive glyph condition and glyph region loss to ensure semantic and stylistic fidelity. On the AnyText-benchmark, our method achieves a sentence accuracy (Sen.Acc) of 0.8267 and a Normalized Edit Distance (NED) of 0.8976 for Chinese text generation, while the English test set delivers even stronger performance with 0.9018 Sen.Acc and 0.9582 NED, surpassing prior methods by substantial margins. To address broader evaluation needs, we introduce two novel benchmarks: GlyphMM-benchmark (for holistic glyph consistency assessment) and MiniText-benchmark (targeting …
Poster
Sherry Chen · Yi Wei · Luowei Zhou · Suren Kumar

[ Exhibit Hall I ]

Abstract
Recent advances in instruction-guided image editing underscore the need for effective automated evaluation. While Vision-Language Models (VLMs) have been explored as judges, open-source models struggle with alignment, and proprietary models lack transparency and cost efficiency. Additionally, no public training datasets exist to fine-tune open-source VLMs, only small benchmarks with diverse evaluation schemes. To address this, we introduce ADIEE, automated dataset creation approaches and scorer for instruction-guided image editing evaluation. We generate a large-scale dataset with over 100K samples and use it to fine-tune a LLaVA-NeXT-8B model. The resulting scorer out-performs all open-source VLMs and Gemini-Pro 1.5 across all benchmarks, achieving a 0.0706 (+17.48%) gain in score correlation with human ratings on AURORA-Bench and improving pair-wise comparison accuracy by 3.48% (+6.22%) on GenAI-Bench and 1.57% (+3.09%) on AURORA-Bench compared to the state-of-the-art. It can also enhance image editing models as a reward model, boosting the average evaluation score of edit outputs with respect to ImagenHub from 6.15 to 6.67 (+8.46%). Our code and dataset will be released upon acceptance.
Poster
Xiang Lv · Mingwen Shao · Lingzhuang Meng · Chang Liu · Yecong Wan · Xinyuan Chen

[ Exhibit Hall I ]

Abstract
Recently, text-driven diffusion models have significantly promoted the development of video editing. However, there still remain two practical challenges: (1) existing text-to-video editing methods struggle to understand negative text prompt, resulting in ineffective suppression of undesirable content in edited video; (2) these methods are difficult to maintain the temporal consistency of the edited video, leading to inter-frame flickering. To address the above challenges, we propose SUV, a novel semantic modulation method based on text embeddings to suppress undesired content in the edited video. Specifically, on the one hand, we discover that the end embeddings (EE) contain substantial coupled positive and negative embeddings, which is the primary reason for the appearance of undesirable content in the edited video. Based on this discovery, we advocate decoupling the negative embeddings from the EE by employing singular value decomposition and propose an exponential suppression operator to decrease the singular values of negative embeddings, thereby restraining the effect of negative embeddings on the edited video content. Subsequently, two constraints are designed to further suppress negative content while keep positive content unchanged via pushing negative embeddings apart and pulling positive embeddings closer. On the other hand, to boost the temporal consistency of edited video, we devise …
Poster
Haonan Wang · Qixiang ZHANG · Lehan Wang · Xuanqi Huang · Xiaomeng Li

[ Exhibit Hall I ]

Abstract
Decoding visual stimuli from neural activity is essential for understanding the human brain. While fMRI methods have successfully reconstructed static images, fMRI-to-video reconstruction faces challenges due to the need for capturing spatiotemporal dynamics like motion and scene transitions. Recent approaches have improved semantic and perceptual alignment but struggle to integrate coarse fMRI data with detailed visual features. Inspired by the hierarchical organization of the visual system, we propose NEURONS, a novel framework that decouples learning into four correlated sub-tasks: key object segmentation, concept recognition, scene description, and blurry video reconstruction. This approach simulates the visual cortex's functional specialization, allowing the model to capture diverse video content. In the inference stage, NEURONS generates robust conditioning signals for a pre-trained text-to-video diffusion model to reconstruct the videos. Extensive experiments demonstrate that NEURONS outperforms state-of-the-art baselines, achieving solid improvements in video consistency (26.6%) and semantic-level accuracy (19.1%). Notably, NEURONS shows a strong functional correlation with the visual cortex, highlighting its potential for brain-computer interfaces and clinical applications. The code will be released upon acceptance.
Poster
Teng-Fang Hsiao · Bo-Kai Ruan · Yi-Lun Wu · Tzu-Ling Lin · Hong-Han Shuai

[ Exhibit Hall I ]

Abstract
Text-and-Image-To-Image (TI2I), an extension of Text-To-Image (T2I), integrates image inputs with textual instructions to enhance image generation. Existing methods often partially utilize image inputs, focusing on specific elements like objects or styles, or they experience a decline in generation quality with complex, multi-image instructions. To overcome these challenges, we introduce Training-Free Text-and-Image-to-Image (TF-TI2I), which adapts cutting-edge T2I models such as SD3 without the need for additional training. Our method capitalizes on the MM-DiT architecture, in which we point out that textual tokens can implicitly learn visual information from vision tokens. We enhance this interaction by extracting a condensed visual representation from reference images, facilitating selective information sharing through Reference Contextual Masking—this technique confines the usage of contextual tokens to instruction-relevant visual information. Additionally, our Winner-Takes-All module mitigates distribution shifts by prioritizing the most pertinent references for each vision token. Addressing the gap in TI2I evaluation, we also introduce the FG-TI2I Bench, a comprehensive benchmark tailored for TI2I and compatible with existing T2I methods. Our approach shows robust performance across various benchmarks, confirming its effectiveness in handling complex image-generation tasks.
Poster
Wang Ziye · Minghang Yu · Chunyan Xu · Zhen Cui

[ Exhibit Hall I ]

Abstract
With the rapid advancement of image generation techniques, robust forgery detection has become increasingly imperative to ensure the trustworthiness of digital media. Recent research indicates that the learned concepts of pre-trained models are critical for identifying forged images. However, misalignment between the forgery and concept spaces hinders the model's forgery detection performance. To address this problem, we propose a novel Semantic Discrepancy-aware Detector (SDD) that leverages reconstruction techniques to align the two spaces at a fine-grained visual level. By exploiting the conceptual knowledge embedded in the pre-trained vision-language model, we specifically design a semantic token sampling module to mitigate the space shifts caused by features irrelevant to both forgery and concepts. A concept-level forgery discrepancy learning module, based on reconstruction, enhances the interaction between concepts and forgeries, effectively capturing discrepancies under the concepts' guidance. Finally, the low-level forged feature enhancement integrates the learned concept-level forgery discrepancies to minimize redundant forgery information. Experiments conducted on two standard image forgery datasets demonstrate the efficacy of the proposed SDD, which achieves superior results compared to existing methods.
Poster
Shyamgopal Karthik · Huseyin Coskun · Zeynep Akata · Sergey Tulyakov · Jian Ren · Anil Kag

[ Exhibit Hall I ]

Abstract
Direct Preference Optimization (DPO) has emerged as a powerful approach to align text-to-image (T2I) models with human feedback. Unfortunately, successful application of DPO to T2I models requires a huge amount of resources to collect and label large-scale datasets, e.g., millions of generated paired images annotated with human preferences. In addition, these human preference datasets can get outdated quickly as the rapid improvements of T2I models lead to higher quality images. In this work, we investigate a scalable approach for collecting large-scale and fully synthetic datasets for DPO training. Specifically, the preferences for paired images are generated using a pre-trained reward function, eliminating the need for involving humans in the annotation process, greatly improving the dataset collection efficiency. Moreover, we demonstrate that such datasets allow averaging predictions across multiple models and collecting ranked preferences as opposed to pairwise preferences. Furthermore, we introduce RankDPO to enhance DPO-based methods using the ranking feedback. Applying RankDPO on SDXL and SD3-Medium models with our synthetically generated preference dataset "Syn-Pic" improves both prompt-following (on benchmarks like T2I-Compbench, GenEval, and DPG-Bench) and visual quality (through user studies). This pipeline presents a practical and scalable solution to develop better preference datasets to enhance the performance of text-to-image models.
Poster
Yilei Jiang · Wei-Hong Li · Yiyuan Zhang · Minghong Cai · Xiangyu Yue

[ Exhibit Hall I ]

Abstract
While Diffusion Models (DM) exhibit remarkable performance across various image generative tasks, they nonetheless reflect the inherent bias presented in the training set.As DMs are now widely used in real-world applications, these biases could perpetuate a distorted worldview and hinder opportunities for minority groups. Existing methods on debiasing DMs usually requires model re-training with a human-crafted reference dataset or additional classifiers, which suffer from two major limitations: (1) collecting reference datasets causes expensive annotation cost; (2) the debiasing performance is heavily constrained by the quality of the reference dataset or the additional classifier. To address the above limitations, we propose FairGen, a plug-and-play method that learns attribute latent directions in a self-discovering manner, thus eliminating the reliance on such reference dataset. Specifically, FairGen consists of two parts: a set of attribute adapters and a distribution indicator. Each adapter in the set aims to learn an attribute latent direction, and is optimized via noise composition through a self-discovering process.Then, the distribution indicator is multiplied by the set of adapters to guide the generation process towards the prescribed distribution. Our method enables debiasing multiple attributes in DMs simultaneously, while remaining lightweight and easily integrable with other DMs, eliminating the need for re-training. …
Poster
Habin Lim · Youngseob Won · Juwon Seo · Gyeong-Moon Park

[ Exhibit Hall I ]

Abstract
In recent years, multi-concept personalization for text-to-image (T2I) diffusion models to represent several subjects at an image has gained much more attention. The main challenge of this task is “concept mixing”, where multiple learned concepts interfere or blend undesirably in the output image.To address this issue, in this paper, we present ConceptSplit, a novel framework to split the individual concepts through training and inference. Our framework comprises two key components. First, we introduce Token-wise Value Adaptation (ToVA), a merging-free training method that focuses exclusively on adapting the value projection in cross-attention. Based on our empirical analysis, we found that modifying the key projection, a common approach in existing methods, can disrupt the attention mechanism and lead to concept mixing. Second, we propose Latent Optimization for Disentangled Attention (LODA), which alleviates attention entanglement during inference by optimizing the input latent. Through extensive qualitative and quantitative experiments, we demonstrate that ConceptSplit achieves robust multi-concept personalization, mitigating unintended concept interference.
Poster
Qihang Yu · Ju He · Xueqing Deng · Xiaohui Shen · Liang-Chieh (Jay) Chen

[ Exhibit Hall I ]

Abstract
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. The proposed RAR is simple: during a standard autoregressive training process with a next-token prediction objective, the input sequence-typically ordered in raster form-is randomly permuted into different factorization orders with a probability r, where r starts at 1 and linearly decays to 0 over the course of training. This annealing training strategy enables the model to learn to maximize the expected likelihood over all factorization orders and thus effectively improve the model's capability of modeling bidirectional contexts. Importantly, RAR preserves the integrity of the autoregressive modeling framework, ensuring full compatibility with language modeling while significantly improving performance in image generation. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-of-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods. Code and models will be made publicly available.
Poster
Dongwon Kim · Ju He · Qihang Yu · Chenglin Yang · Xiaohui Shen · Suha Kwak · Liang-Chieh (Jay) Chen

[ Exhibit Hall I ]

Abstract
Image tokenizers form the foundation of modern text-toimage generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce **T**ext-**A**ware **T**ransformer-based 1-D**i**mensional **Tok**enizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image **Mask**ed **Gen**erative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.
Poster
Yitian Zhang · Long Mai · Aniruddha Mahapatra · David Bourgin · Yicong Hong · Jonah Casebeer · Feng Liu · Yun Fu

[ Exhibit Hall I ]

Abstract
We present a novel perspective on learning video embedders for generative modeling: rather than requiring an exact reproduction of an input video, an effective embedder should focus on synthesizing visually plausible reconstructions. This relaxed criterion enables substantial improvements in compression ratios without compromising the quality of downstream generative models. Specifically, we propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework that employs a diffusion transformer (DiT) to synthesize missing details from a compact latent space. Therein, we develop a dedicated latent conditioning module to condition the DiT decoder on the encoded video latent embedding. Our experiments demonstrate that our approach enables superior encoding-decoding performance compared to state-of-the-art methods, particularly as the compression ratio increases. To demonstrate the efficacy of our approach, we report results from our video embedders achieving a temporal compression ratio of up to 32× (8× higher than leading video emebbders) and validate the robustness of this ultra-compact latent space for text-to-video generation, providing a significant efficiency boost in latent diffusion model training and inference.
Poster
Wenda SHI · Yiren Song · Dengming Zhang · Jiaming Liu · XINGXING ZOU

[ Exhibit Hall I ]

Abstract
Visual text rendering are widespread in various real-world applications, requiring careful font selection and typographic choices. Recent progress in diffusion transformer (DiT)-based text-to-image (T2I) models show promise in automating these processes. However, these methods still encounter challenges like inconsistent fonts, style variation, and limited fine-grained control, particularly at the word-level. This paper proposes a two-stage DiT-based pipeline to address these problems by enhancing controllability over typography and style in text rendering. We introduce typography control fine-tuning (TC-FT), an parameter-efficient fine-tuning method (on 5% key parameters) with enclosing typography control tokens (ETC-tokens), which enables precise word-level application of typographic features. To further address style inconsistency in text rendering, we propose a text-agnostic style control adapter (SCA) that prevents content leakage while enhancing style consistency. To implement TC-FT and SCA effectively, we incorporated HTML-render into the data synthesis pipeline and proposed the first word-level controllable dataset. Through comprehensive experiments, we demonstrate the effectiveness of our approach in achieving superior word-level typographic control, font consistency, and style consistency in text rendering tasks. The datasets and models will be available for academic use.
Poster
Xiangxiang Chu · Renda Li · Yong Wang

[ Exhibit Hall I ]

Abstract
Recent studies have highlighted the interplay between diffusion models and representation learning. Intermediate representations from diffusion models can be leveraged for downstream visual tasks, while self-supervised vision models can enhance the convergence and generation quality of diffusion models. However, transferring pretrained weights from vision models to diffusion models is challenging due to input mismatches and the use of latent spaces. To address these challenges, we propose Unified Self-supervised Pretraining (USP), a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. USP achieves comparable performance in understanding tasks while significantly improving the convergence speed and generation quality of diffusion models. Our code will be publicly available.
Poster
Hui Zhang · Dexiang Hong · Yitong Wang · Jie Shao · Xinglong Wu · Zuxuan Wu · Yu-Gang Jiang

[ Exhibit Hall I ]

Abstract
Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality. As a result, Layout-to-Image (L2I) generation has been proposed to leverage region-specific positions and descriptions to enable more precise and controllable generation. However, previous methods primarily focus on UNet-based models (e.g., SD1.5 and SDXL), and limited effort has explored Multimodal Diffusion Transformers (MM-DiTs), which have demonstrated powerful image generation capabilities. Enabling MM-DiT for layout-to-image generation seems straightforward but is challenging due to the complexity of how layout is introduced, integrated, and balanced among multiple modalities. To this end, we explore various network variants to efficiently incorporate layout guidance into MM-DiT, and ultimately present SiamLayout. To Inherit the advantages of MM-DiT, we use a separate set of network weights to process the layout, treating it as equally important as the image and text modalities. Meanwhile, to alleviate the competition among modalities, we decouple the image-layout interaction into a siamese branch alongside the image-text one and fuse them in the later stage. Moreover, we contribute a large-scale layout dataset, named LayoutSAM, which includes 2.7 million image-text pairs and 10.7 million entities. Each entity is annotated with a bounding box …
Poster
Yachun Mi · Yu Li · Weicheng Meng · Chaofeng Chen · Chen Hui · Shaohui Liu

[ Exhibit Hall I ]

Abstract
The rapid growth of long-duration, high-definition videos has made efficient video quality assessment (VQA) a critical challenge. Existing research typically tackles this problem through two main strategies: reducing model parameters and resampling inputs. However, light-weight Convolution Neural Networks (CNN) and Transformers often struggle to balance efficiency with high performance due to the requirement of long-range modeling capabilities. Recently, the state-space model, particularly Mamba, has emerged as a promising alternative, offering linear complexity with respect to sequence length. Meanwhile, efficient VQA heavily depends on resampling long sequences to minimize computational costs, yet current resampling methods are often weak in preserving essential semantic information. In this work, we present MVQA, a Mamba-based model designed for efficient VQA along with a novel Unified Semantic and Distortion Sampling (USDS) approach. USDS combines semantic patch sampling from low-resolution videos and distortion patch sampling from original-resolution videos. The former captures semantically dense regions, while the latter retains critical distortion details. To prevent computation increase from dual inputs, we propose a fusion mechanism using pre-defined masks, enabling a unified sampling strategy that captures both semantic and quality information without additional computational burden. Experiments show that the proposed MVQA, equipped with USDS, achieve comparable performance to state-of-the-art methods …
Poster
Dongyeun Lee · jiwan hur · Hyounguk Shon · Jae Young Lee · Junmo Kim

[ Exhibit Hall I ]

Abstract
Diffusion models have achieved remarkable success in image generation but come with significant computational costs, posing challenges for deployment in resource-constrained environments. Recent post-training quantization (PTQ) methods have attempted to mitigate this issue by focusing on the iterative nature of diffusion models. However, these approaches often overlook outliers, leading to degraded performance at low bit-widths. In this paper, we propose a DMQ which combines Learned Equivalent Scaling (LES) and channel-wise Power-of-Two Scaling (PTS) to effectively address these challenges. Learned Equivalent Scaling optimizes channel-wise scaling factors to redistribute quantization difficulty between weights and activations, reducing overall quantization error. Recognizing that early denoising steps, despite having small quantization errors, crucially impact the final output due to error accumulation, we incorporate an adaptive timestep weighting scheme to prioritize these critical steps during learning. Furthermore, identifying that layers such as skip connections exhibit high inter-channel variance, we introduce channel-wise Power-of-Two Scaling for activations. To ensure robust selection of PTS factors even with small calibration set, we introduce a voting algorithm that enhances reliability. Extensive experiments demonstrate that our method significantly outperforms existing PTQ techniques, especially at low bit-widths such as W4A6 (4-bit weight, 6-bit activation) and W4A8, maintaining high image generation quality and model …
Poster
Zizhuo Li · Yifan Lu · Linfeng Tang · Shihua Zhang · Jiayi Ma

[ Exhibit Hall I ]

Abstract
This prospective study proposes CoMatch, a novel semi-dense image matcher with dynamic covisibility awareness and bilateral subpixel accuracy. Firstly, observing that modeling context interaction over the entire coarse feature map elicits highly redundant computation due to the neighboring representation similarity of tokens, a covisibility-guided token condenser is introduced to adaptively aggregate tokens in light of their covisibility scores that are dynamically estimated, thereby ensuring computational efficiency while improving the representational capacity of aggregated tokens simultaneously. Secondly, considering that feature interaction with massive non-covisible areas is distracting, which may degrade feature distinctiveness, a covisibility-assisted attention mechanism is deployed to selectively suppress irrelevant message broadcast from non-covisible reduced tokens, resulting in robust and compact attention to relevant rather than all ones. Thirdly, we find that at the fine-level stage, current methods adjust only the target view's keypoints to subpixel level, while those in the source view remain restricted at the coarse level and thus not informative enough, detrimental to keypoint location-sensitive usages. A simple yet potent fine correlation module is developed to refine the matching candidates in both source and target views to subpixel level, attaining attractive performance improvement. Thorough experimentation across an array of public benchmarks affirms CoMatch’s promising accuracy, efficiency, …
Poster
Dat Cong · Hieu Tran · Hoang Thanh-Tung

[ Exhibit Hall I ]

Abstract
Diffusion models have gained prominence as state-of-the-art techniques for synthesizing images and videos, particularly due to their ability to scale effectively with large datasets. Recent studies have uncovered that these extensive datasets often contain mistakes from manual labeling processes. However, the extent to which such errors compromise the generative capabilities and controllability of diffusion models is not well studied. This paper introduces Score-based Discriminator Correction (SBDC), a guidance technique for aligning noisy pre-trained conditional diffusion models. The guidance is built on discriminator training using adversarial loss, drawing on prior noise detection techniques to assess the authenticity of each sample.We further show that limiting the usage of our guidance to the early phase of the generation process leads to better performance.Our method is computationally efficient, only marginally increases inference time, and does not require retraining diffusion models.Experiments on different noise settings demonstrate the superiority of our method over previous state-of-the-art methods.
Poster
Junyi Guo · Jingxuan Zhang · Fangyu Wu · Huanda Lu · Qiufeng Wang · Wenmian Yang · ENG Gee LIM · Dongming Lu

[ Exhibit Hall I ]

Abstract
Diffusion-based garment synthesis tasks primarily focus on the design phase in the fashion domain, while the garment production process remains largely underexplored. To bridge this gap, we introduce a new task: Flat Sketch to Realistic Garment Image (FS2RG), which generates realistic garment images by integrating flat sketches and textual guidance. FS2RG presents two key challenges: 1) fabric characteristics are solely guided by textual prompts, providing insufficient visual supervision for diffusion-based models, which limits their ability to capture fine-grained fabric details; 2) flat sketches and textual guidance may provide conflicting information, requiring the model to selectively preserve or modify garment attributes while maintaining structural coherence.To tackle this task, we propose HiGarment, a novel framework that comprises two core components: i) a multi-modal semantic enhancement mechanism that enhances fabric representation across textual and visual modalities, and ii) a harmonized cross-attention mechanism that dynamically balances information from flat sketches and text prompts, allowing controllable synthesis by generating either sketch-aligned (image-biased) or text-guided (text-biased) outputs. Furthermore, we collect Multi-modal Detailed Garment, the largest open-source dataset for garment generation. Experimental results and user studies demonstrate the effectiveness of HiGarment in garment synthesis. The code and dataset will be released.
Poster
Xiaomeng Fu · Jia Li

[ Exhibit Hall I ]

Abstract
Diffusion models have achieved remarkable success in image and video generation due to their powerful generative capabilities. However, they suffer from slow inference speed and high computational costs. Existing acceleration methods for diffusion models may compromise model performance and struggle to generalize across diverse diffusion model architectures and downstream tasks. To address these issues, we propose a model-agnostic and highly scalable acceleration strategy for text-controlled image generation. Specifically, we dynamically modulate the text guidance coefficience and truncate redundant text-related computations during the denoising process. Experimental results demonstrate that our approach achieves significant model acceleration while preserving precise text-image alignment, showcasing the potential for a wide range of diffusion models and downstream applications.
Poster
Yujie Zhang · Bingyang Cui · Qi Yang · Zhu Li · Yiling Xu

[ Exhibit Hall I ]

Abstract
Text-to-3D generation has achieved remarkable progress in recent years, yet evaluating these methods remains challenging for two reasons: i) existing benchmarks lack fine-grained evaluation on different prompt categories and evaluation dimensions; ii) previous evaluation metrics only focus on a single aspect (e.g., text-3D alignment) and fail to perform multi-dimensional quality assessment. To address these problems, we first propose a comprehensive benchmark named MATE-3D. The benchmark contains eight well-designed prompt categories that cover single and multiple object generation, resulting in 1,280 generated textured meshes. We have conducted a large-scale subjective experiment from four different evaluation dimensions and collected 107,520 annotations, followed by detailed analyses of the results. Based on MATE-3D, we propose a novel quality evaluator named HyperScore. Utilizing hypernetwork to generate specified mapping functions for each evaluation dimension, our metric can effectively perform multi-dimensional quality assessment. HyperScore presents superior performance over existing metrics on MATE-3D, making it a promising metric for assessing and improving text-to-3D generation.
Poster
Wei Xu · Kangjie Chen · Jiawei Qiu · Yuyang zhang · Run Wang · Jin Mao · Tianwei Zhang · Lina Wang

[ Exhibit Hall I ]

Abstract
Text-to-image models have achieved remarkable progress in generating high-quality images from textual prompts, yet their potential for misuse like generating unsafe content remains a critical concern.Existing safety mechanisms, such as filtering and fine-tuning, remain insufficient in preventing vulnerabilities exposed by adversarial prompts. To systematically evaluate these weaknesses, we propose an automated red-teaming framework, Feedback-Guided Prompt Iteration (FGPI), which utilizes a Vision-Language Model (VLM) as the red-teaming agent following a feedback-guide-rewrite paradigm for iterative prompt optimization.The red-teaming VLM analyzes prompt-image pairs based on evaluation results, provides feedback and modification strategies to enhance adversarial effectiveness while preserving safety constraints, and iteratively improves prompts.To enable this functionality, we construct a multi-turn conversational VQA dataset with over 6,000 instances, covering seven attack types and facilitating the fine-tuning of the red-teaming VLM.Extensive experiments demonstrate the effectiveness of our approach, achieving over 90\% attack success rate within five iterations while maintaining prompt stealthiness and safety. The experiments also validate the adaptability, diversity, transferability, and explainability of FGPI.The source code and dataset are available at (*URL omitted for double-blind reviewing; code available in supplementary materials*).
Poster
Seunggwan Lee · Hwanhee Jung · ByoungSoo Koh · Qixing Huang · Sang Yoon · Sangpil Kim

[ Exhibit Hall I ]

Abstract
A fundamental challenge in conditional 3D shape generation is to minimize the information loss and maximize the intention of user input. Existing approaches have predominantly focused on two types of isolated conditional signals, i.e., user sketches and text descriptions, each of which does not offer flexible control of the generated shape. In this paper, we introduce PASTA, the flexible approach that seamlessly integrates a user sketch and a text description for 3D shape generation. The key idea is to use text embeddings from a vision-language model to enrich the semantic representation of sketches. Specifically, these text-derived priors specify the part components of the object, compensating for missing visual cues from ambiguous sketches. In addition, we introduce ISG-Net which employs two types of graph convolutional networks: IndivGCN, which processes fine-grained details, and PartGCN, which aggregates these details into parts and refines the structure of objects. Extensive experiments demonstrate that PASTA outperforms existing methods in part-level editing and achieves state-of-the-art results in sketch-to-3D shape generation.
Poster
Yuqing Wang · Zhijie Lin · Yao Teng · Yuanzhi Zhu · Shuhuai Ren · Jiashi Feng · Xihui Liu

[ Exhibit Hall I ]

Abstract
Autoregressive visual generation models typically rely on tokenizers to compress images into tokens that can be predicted sequentially. A fundamental dilemma exists in token representation: discrete tokens enable straightforward modeling with standard cross-entropy loss, but suffer from information loss and tokenizer training instability; continuous tokens better preserve visual details, but require complex distribution modeling, complicating the generation pipeline. In this paper, we propose TokenBridge, which bridges this gap by maintaining the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. To achieve this, we decouple discretization from the tokenizer training process through post-training quantization that directly obtains discrete tokens from continuous representations. Specifically, we introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism that efficiently model the resulting large token space. Extensive experiments show that our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction. This work demonstrates that bridging discrete and continuous paradigms can effectively harness the strengths of both approaches, providing a promising direction for high-quality visual generation with simple autoregressive modeling.
Poster
Yuanzhi Zhu · Xi WANG · Stéphane Lathuilière · Vicky Kalogeiton

[ Exhibit Hall I ]

Abstract
Masked Diffusion Models (MDMs) have emerged as a powerful generative modeling technique. Despite their remarkable results, they typically suffer from slow inference with several steps. In this paper, we propose Di$\mathtt{[M]}$O, a novel approach that distills masked diffusion models into a one-step generator.Di$\mathtt{[M]}$O addresses two key challenges: (1) the intractability of using intermediate-step information for one-step generation, which we solve through token-level distribution matching that optimizes model output logits by an `on-policy framework' with the help of an auxiliary model; and (2) the lack of entropy in the initial distribution, which we address through a token initialization strategy that injects randomness while maintaining similarity to teacher training distribution. We show Di$\mathtt{[M]}$O's effectiveness on both class-conditional and text-conditional image generation, impressively achieving performance competitive to multi-step teacher outputs while drastically reducing inference time. To our knowledge, we are the first to successfully achieve one-step distillation of masked diffusion models and the first to apply discrete distillation to text-to-image generation, opening new paths for efficient generative modeling.
Poster
Trevor Canham · SaiKiran Tedla · Michael Murdoch · Michael Brown

[ Exhibit Hall I ]

Abstract
While most images shared on the web and social media platforms are encoded in standard dynamic range (SDR), many displays now can accommodate high dynamic range (HDR) content. Additionally, modern cameras can capture images in an HDR format but convert them to SDR to ensure maximum compatibility with existing workflows and legacy displays. To support both SDR and HDR, new encoding formats are emerging that store additional metadata in SDR images in the form of a gain map. When applied to the SDR image, the gain map recovers the HDR version of the image as needed. These gain maps, however, are typically down-sampled and encoded using standard image compression, such as JPEG and HEIC, which can result in unwanted artifacts. In this paper, we propose to use a lightweight multi-layer perceptron (MLP) network to encode the gain map. The MLP is optimized using the SDR image information as input and provides superior performance in terms of HDR reconstruction. Moreover, the MLP-based approach uses a fixed memory footprint (10 KB) and requires no additional adjustments to accommodate different image sizes or encoding parameters. We conduct extensive experiments on various MLP based HDR embedding strategies and demonstrate that our approach outperforms the …
Poster
Tu Bui · Shruti Agarwal · John Collomosse

[ Exhibit Hall I ]

Abstract
Imperceptible digital watermarking is important in copyright protection, misinformation prevention, and responsible generative AI. We propose TrustMark - a watermarking method that leverages a spatio-spectral loss function and a 1x1 convolution layer to enhance encoding quality. TrustMark is robust against both in-place and out-of-place perturbations while maintaining image quality above 43 dB. Additionally, we propose TrustMark-RM, a watermark removal method designed for re-watermarking, along with a simple yet effective algorithm that enables both TrustMark and TrustMark-RM to operate seamlessly across arbitrary resolutions. Our methods achieve state-of-art performance on 3 benchmarks. Models and code are released under MIT license and an anonymized version is included for review.
Poster
Hongyang Wei · Shuaizheng Liu · Chun Yuan · Lei Zhang

[ Exhibit Hall I ]

Abstract
By leveraging the generative priors from pre-trained text-to-image diffusion models, significant progress has been made in real-world image super-resolution (Real-ISR). However, these methods tend to generate inaccurate and unnatural reconstructions in complex and/or heavily degraded scenes, primarily due to their limited perception and understanding capability of the input low-quality image. To address these limitations, we propose, for the first time to our knowledge, to adapt the pre-trained autoregressive multimodal model such as Lumina-mGPT into a robust Real-ISR model, namely PURE, which Perceives and Understands the input low-quality image, then REstores its high-quality counterpart. Specifically, we implement instruction tuning on Lumina-mGPT to perceive the image degradation level and the relationships between previously generated image tokens and the next token, understand the image content by generating image semantic descriptions, and consequently restore the image by generating high-quality image tokens autoregressively with the collected information. In addition, we reveal that the image token entropy reflects the image structure and present a entropy-based Top-$k$ sampling strategy to optimize the local structure of the image during inference. Experimental results demonstrate that PURE preserves image content while generating realistic details, especially in complex scenes with multiple objects, showcasing the potential of autoregressive multimodal generative models for …
Poster
Ling Lo · Kelvin Chan · Wen-Huang Cheng · Ming-Hsuan Yang

[ Exhibit Hall I ]

Abstract
Existing models often struggle with complex temporal changes, particularly when generating videos with gradual attribute transitions.The most common prompt interpolation approach for motion transitions often fails to handle gradual attribute transitions, where inconsistencies tend to become more pronounced. In contrast, we extend the model to generate smooth and consistent attribute transitions by introducing frame-wise guidance for the video latent during the denoising process. Our approach constructs a data-specific transitional direction for each noisy latent, guiding the gradual shift from initial to final attributes frame by frame while preserving the motion dynamics of the video. Moreover, we present the Controlled-Attribute-Transition Benchmark (CAT-Bench), which integrates both attribute and motion dynamics, to comprehensively evaluate the performance of different models. We further propose two metrics to assess the accuracy and smoothness of attribute transitions. Experimental results demonstrate that our approach performs favorably against existing baselines, achieving visual fidelity, maintaining alignment with text prompts, and delivering seamless attribute transitions.
Poster
Youwei Zheng · Yuxi Ren · Xin Xia · Xuefeng Xiao · Xiaohua Xie

[ Exhibit Hall I ]

Abstract
Diffusion Transformer (DiT) has demonstrated remarkable performance in text-to-image generation; however, its large parameter size results in substantial inference overhead. Existing parameter compression methods primarily focus on pruning, but aggressive pruning often leads to severe performance degradation due to reduced model capacity. To address this limitation, we pioneer the transformation of a dense DiT into a Mixture of Experts (MoE) for structured sparsification, reducing the number of activated parameters while preserving model capacity. Specifically, we replace the Feed-Forward Networks (FFNs) in DiT Blocks with MoE layers, reducing the number of activated parameters in the FFNs by 62.5\%.Furthermore, we propose the Mixture of Blocks (MoB) to selectively activate DiT blocks, thereby further enhancing sparsity.To ensure an effective dense-to-MoE conversion, we design a multi-step distillation pipeline, incorporating Taylor metric-based expert initialization, knowledge distillation with load balancing, and group feature loss for MoB optimization. We transform large diffusion transformers (e.g., FLUX.1 [dev]) into an MoE structure, reducing activated parameters by 60\% while maintaining original performance and surpassing pruning-based approaches in extensive experiments. Overall, Dense2MoE establishes a new paradigm for efficient text-to-image generation.
Poster
Fangfu Liu · Hanyang Wang · Yimo Cai · Kaiyan Zhang · Xiaohang Zhan · Yueqi Duan

[ Exhibit Hall I ]

Abstract
With the scale capability of increasing training data, model size, and computational cost, video generation has achieved impressive results in digital creation, enabling users to express creativity across various domains. Recently, researchers in Large Language Models (LLMs) have expanded the scaling to test-time, which can significantly improve LLM performance by using more inference-time computation. Instead of scaling up video foundation models through expensive training costs, we explore the power of Test-Time Scaling (TTS) in video generation, aiming to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt. In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution. Specifically, we build the search space with test-time verifiers to provide feedback and heuristic algorithms to guide searching process. Given a text prompt, we first explore an intuitive linear search strategy by increasing noise candidates at inference time. As full-step denoising all frames simultaneously requires heavy test-time computation costs, we further design a more efficient TTS method for video generation called Tree-of-Frames (ToF) that adaptively …
Poster
shaojin wu · Mengqi Huang · wenxu wu · Yufeng Cheng · Fei Ding · Qian HE

[ Exhibit Hall I ]

Abstract
Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to address this challenge. It leverages the intrinsic in-context generation capabilities of diffusion transformers. Additionally, we introduce $UNO$, which consist of progressive cross-modal alignment and universal rotary position embedding. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.
Poster
Shengrong Yuan · Runmin Wang · Ke Hao · Xu-Qi Ma · Changxin Gao · Li Liu · Nong Sang

[ Exhibit Hall I ]

Abstract
Scene text image super-resolution (STISR) focuses on enhancing the clarity and readability of low-resolution text images. Existing methods often rely on text probability distribution priors derived from text recognizers to guide the super-resolution process. While effective in capturing general structural information of text, these priors lack the ability to preserve specific text style details, such as font, stereoscopic effect and spatial transformation, leading to a loss of visual quality and stylistic consistency in the super-resolved images. To address these limitations, we propose a Style embedding-based scene text image Super-Resolution Network (StyleSRN), which introduces a text style embedding mechanism to preserve and enhance text style features during the super-resolution process. The proposed architecture includes a Style Enhancement Block for capturing multi-scale cross-channel dependencies, and a Style Content Fusion Block that effectively integrates text content with style information, ensuring that the structure and style of the restored text are not distorted. Furthermore, we introduce a Text Style Loss based on the Gram matrix to supervise the reconstruction process at the style level, thereby maintaining the stylistic consistency of the restored text images. Extensive experiments on the TextZoom dataset and five scene text recognition benchmarks demonstrate the superiority of our method. The code …
Poster
Wei Chen · Jingxi Yu · Zichen Miao · Qiang Qiu

[ Exhibit Hall I ]

Abstract
Large pre-trained transformers have revolutionized artificial intelligence across various domains, and fine-tuning remains the dominant approach for adapting these models to downstream tasks due to the cost of training from scratch. However, in existing fine-tuning methods, the updated representations are formed as a dense combination of modified parameters, making it challenging to interpret their contributions and understand how the model adapts to new tasks.In this work, we introduce a fine-tuning framework inspired by sparse coding, where fine-tuned features are represented as a sparse combination of basic elements, i.e., feature dictionary atoms. Sparse coefficients then serve as indicators of atom importance, identifying the contribution of each atom to the updated representation.The feature dictionary atoms function as fundamental building blocks of the representation, and tuning atoms allows for seamless adaptation to downstream tasks.Leveraging the atom selection capability of sparse coefficients, we first demonstrate that our method enhances image editing performance by improving text alignment through the removal of unimportant feature dictionary atoms.Additionally, we validate the effectiveness of our approach in the text-to-image concept customization task, where our method efficiently constructs the target concept using a sparse combination of feature dictionary atoms, outperforming various baseline fine-tuning methods.
Poster
Xinli Xu · Wenhang Ge · Jiantao Lin · Jiawei Feng · Lie XU · hanfeng Zhao · Shunsi Zhang · Ying-Cong Chen

[ Exhibit Hall I ]

Abstract
In this work, we introduce FlexGen, a flexible framework designed to generate controllable and consistent multi-view images, conditioned on a single-view image, or a text prompt, or both. FlexGen tackles the challenges of controllable multi-view synthesis through additional conditioning on 3D-aware text annotations. We utilize the strong reasoning capabilities of GPT-4V to generate 3D-aware text annotations. By analyzing four orthogonal views of an object arranged as tiled multi-view images, GPT-4V can produce text annotations that include 3D-aware information with spatial relationship. By integrating the control signal with proposed adaptive dual-control module, our model can generate multi-view images that correspond to the specified text. FlexGen supports multiple controllable capabilities, allowing users to modify text prompts to generate reasonable and corresponding unseen parts. Additionally, users can influence attributes such as appearance and material properties, including metallic and roughness. Extensive experiments demonstrate that our approach offers enhanced multiple controllability, marking a significant advancement over existing multi-view diffusion models. This work has substantial implications for fields requiring rapid and flexible 3D content creation, including game development, animation, and virtual reality.
Poster
Yatai Ji · Jiacheng Zhang · Jie Wu · Shilong Zhang · Shoufa Chen · Chongjian GE · Peize Sun · Weifeng Chen · Wenqi Shao · Xuefeng Xiao · Weilin Huang · Ping Luo

[ Exhibit Hall I ]

Abstract
Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs, where the textual prompts play a pivotal role in determining quality of output videos. However, achieving the desired output often entails multiple revisions and iterative inference to refine user-provided prompts. Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware when applied to text-to-video diffusion models. To address these problem, we introduce an LLM-based prompt adaptation framework, termed as Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model. Our approach involves a meticulously crafted two-stage optimization and alignment system. Initially, we conduct a reward-guided prompt evolution pipeline to automatically create optimal prompts pool and leverage them for supervised fine-tuning (SFT) of the LLM. Then multi-dimensional rewards are employed to generate pairwise data for the SFT model, followed by the direct preference optimization (DPO) algorithm to further facilitate preference alignment. Through extensive experimentation and comparative analyses, we validate the effectiveness of Prompt-A-Video across diverse generation models, highlighting its potential to push the boundaries of video generation.
Poster
Delong Zhang · Qiwei Huang · Yang Sun · Yuanliu Liu · Wei-Shi Zheng · Pengfei Xiong · Wei Zhang

[ Exhibit Hall I ]

Abstract
Diffusion-based virtual try-on aims to synthesize a realistic image that seamlessly integrating the specific garment into a target model.The primary challenge lies in effectively guiding the warping process of diffusion model.However, previous methods either lack direct guidance or explicitly warp the garment image, which highly depends on the performance of the warping module.In this paper, we propose FIA-VTON, which leverages the \textbf{implicit} flow feature as guidance by adopting a Flow Infused Attention module on virtual try-on. The dense warp flow map is projected as indirect guidance attention to enhance the feature map warping in the generation process implicitly, which is less sensitive to the warping estimation accuracy than an explicit warp of the garment image. To further enhance implicit warp guidance, we incorporate high-level spatial attention to complement the dense warp.Experimental results on the VTON-HD and DressCode dataset significantly outperform state-of-the-art methods, demonstrating that FIA-VTON is effective and robust for virtual try-on.
Poster
Ziyin Zhou · Yunpeng Luo · Yuanchen Wu · Ke Sun · Jiayi Ji · Ke Yan · Shouhong Ding · Xiaoshuai Sun · Yunsheng Wu · Rongrong Ji

[ Exhibit Hall I ]

Abstract
The rapid development of AI-generated content (AIGC) technology has led to the misuse of highly realistic AI-generated images (AIGI) in spreading misinformation, posing a threat to public information security. Although existing AIGI detection techniques are generally effective, they face two issues: 1) a lack of human-verifiable explanations, and 2) a lack of generalization in the latest generation technology. To address these issues, we introduce a large-scale and comprehensive dataset, Holmes-Set, which includes the Holmes-SFTSet, an instruction-tuning dataset with explanations on whether images are AI-generated, and the Holmes-DPOSet, a human-aligned preference dataset. Our work introduces an efficient data annotation method called the Multi-Expert Jury, enhancing data generation through structured MLLM explanations and quality control via cross-model evaluation, expert defect filtering, and human preference modification. In addition, we propose Holmes Pipeline, a meticulously designed three-stage training framework comprising visual expert pre-training, supervised fine-tuning, and direct preference optimization. Holmes Pipeline adapts multimodal large language models (MLLMs) for AIGI detection while generating human-verifiable and human-aligned explanations, ultimately yielding our model AIGI-Holmes. During the inference stage, we introduce a collaborative decoding strategy that integrates the model perception of the visual expert with the semantic reasoning of MLLMs, further enhancing the generalization capabilities. Extensive experiments on …
Poster
Sung Ju Lee · Nam Ik Cho

[ Exhibit Hall I ]

Abstract
Semantic watermarking techniques for latent diffusion models (LDMs) are robust against regeneration attacks, but often suffer from detection performance degradation due to the loss of frequency integrity. To tackle this problem, we propose a novel embedding method called Hermitian Symmetric Fourier Watermarking (SFW), which maintains frequency integrity by enforcing Hermitian symmetry.Additionally, we introduce a center-aware embedding strategy that reduces the vulnerability of semantic watermarking due to cropping attacks by ensuring robust information retention. To validate our approach, we apply these techniques to existing semantic watermarking schemes, enhancing their frequency-domain structures for better robustness and retrieval accuracy.Extensive experiments demonstrate that our methods achieve state-of-the-art verification and identification performance, surpassing previous approaches across various attack scenarios. Ablation studies confirm the impact of SFW on detection capabilities, the effectiveness of the center-aware embedding against cropping, and how message capacity influences identification accuracy. Notably, our method achieves the highest detection accuracy while maintaining superior image fidelity, as evidenced by FID scores.Conclusively, our proposed SFW is shown to be an effective framework for balancing robustness and image fidelity, addressing the inherent trade-offs in semantic watermarking.Our code will be publicly available soon.
Poster
Tianwei Xiong · Jun Hao Liew · Zilong Huang · Jiashi Feng · Xihui Liu

[ Exhibit Hall I ]

Abstract
In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality—a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma. To mitigate this, we propose semantic regularization, which aligns tokenizer features with semantically consistent features from a pre-trained visual encoder. This constraint prevents excessive latent space complexity during scaling, yielding consistent improvements in both reconstruction and downstream autoregressive generation. Building on semantic regularization, we explore three key practices for scaling tokenizers: (1) using 1D tokenizers for better scalability, (2) prioritizing decoder scaling when expanding both encoder and decoder, and (3) employing entropy loss to stabilize training for billion-scale tokenizers. By scaling to 3 billion parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.
Poster
Francesco Taioli · Edoardo Zorzi · Gianni Franchi · Alberto Castellini · Alessandro Farinelli · Marco Cristani · Yiming Wang

[ Exhibit Hall I ]

Abstract
Language-driven instance object navigation assumes that human users initiate the task by providing a detailed description of the target instance to the embodied agent.While this description is crucial for distinguishing the target from visually similar instances in a scene, providing it prior to navigation can be demanding for human. To bridge this gap, we introduce Collaborative Instance object Navigation (CoIN), a new task setting where the agent actively resolve uncertainties about the target instance during navigation in natural, template-free, open-ended dialogues with human. We propose a novel training-free method, Agent-user Interaction with UncerTainty Awareness (AIUTA), which operates independently from the navigation policy, and focuses on the human-agent interaction reasoning with Vision-Language Models (VLMs) and Large Language Models (LLMs). First, upon object detection, a Self-Questioner model initiates a self-dialogue within the agent to obtain a complete and accurate observation description with a novel uncertainty estimation technique. Then, an Interaction Trigger module determines whether to ask a question to the human, continue or halt navigation, minimizing user input. For evaluation, we introduce CoIN-Bench, with a curated dataset designed for challenging multi-instance scenarios. CoIN-Bench supports both online evaluation with humans and reproducible experiments with simulated user-agent interactions. On CoIN-Bench, we show that AIUTA …
Poster
Duong T. Tran · Trung-Kien Tran · Manfred Hauswirth · Danh Le-Phuoc

[ Exhibit Hall I ]

Abstract
In this paper, we propose a new dataset, ReasonVQA, for the Visual Question Answering (VQA) task. Our dataset is automatically integrated with structured encyclopedic knowledge and constructed using a low-cost framework, which is capable of generating complex, multi-hop questions. We evaluated state-of-the-art VQA models on ReasonVQA, and the empirical results demonstrate that ReasonVQA poses significant challenges to these models, highlighting its potential for benchmarking and advancing the field of VQA. Additionally, our dataset can be easily scaled with respect to input images; the current version surpasses the largest existing datasets requiring external knowledge by more than an order of magnitude.
Poster
Achint Soni · Meet Soni · Sirisha Rambhatla

[ Exhibit Hall I ]

Abstract
Text-guided image editing aims to modify specific regions of an image according to natural language instructions while maintaining the general structure and the background fidelity. Existing methods utilize masks derived from cross-attention maps generated from diffusion models to identify the target regions for modification. However, since cross-attention mechanisms focus on semantic relevance, they struggle to maintain the image integrity. As a result, these methods often lack spatial consistency, leading to editing artifacts and distortions. In this work, we address these limitations and introduce \textbf{LOCATEdit}, which enhances cross-attention maps through a graph-based approach utilizing self-attention-derived patch relationships to maintain smooth, coherent attention across image regions, ensuring that alterations are limited to the designated items while retaining the surrounding structure. LOCATEdit consistently and substantially outperforms existing baselines on PIE-Bench, demonstrating its state-of-the-art performance and effectiveness on various editing tasks.
Poster
Mang Cao · Sanping Zhou · Yizhe Li · Ye Deng · Wenli Huang · Le Wang

[ Exhibit Hall I ]

Abstract
Sufficient cross-task interaction is crucial for success in multi-task dense prediction. However, sufficient interaction often results in high computational complexity, forcing existing methods to face the trade-off between interaction completeness and computational efficiency. To address this limitation, this work proposes a Bidirectional Interaction Mamba (BIM), which incorporates novel scan mechanisms to adapt the Mamba modeling approach for multi-task dense prediction. On the one hand, we introduce a novel Bidirectional Interaction Scan (BI-Scan) mechanism, which constructs task-specific representations as bidirectional sequences during interaction. By integrating task-first and position-first scan modes within a unified linear complexity architecture, BI-Scan efficiently preserves critical cross-task information. On the other hand, we employ a Multi-Scale Scan~(MS-Scan) mechanism to achieve multi-granularity scene modeling. This design not only meets the diverse granularity requirements of various tasks but also enhances nuanced cross-task feature interactions. Extensive experiments on two challenging benchmarks, i.e., NYUD-V2 and PASCAL-Context, show the superiority of our BIM vs its state-of-the-art competitors.
Poster
Trong Bang Nguyen · Phi Le Nguyen · Simon Lucey · Minh Hoai

[ Exhibit Hall I ]

Abstract
Data attribution in text-to-image generative models is a crucial yet underexplored problem, particularly at the regional level, where identifying the most influential training regions for generated content can enhance transparency, copyright protection, and error diagnosis. Existing data attribution methods either operate at the whole-image level or lack scalability for large-scale generative models. In this work, we propose a novel framework for region-level data attribution. At its core is the Attribution Region (AR) detector, which localizes influential regions in training images used by the text-to-image generative model. To support this research, we construct a large-scale synthetic dataset with ground-truth region-level attributions, enabling both training and evaluation of our method. Empirical results show that our method outperforms existing attribution techniques in accurately tracing generated content back to training data. Additionally, we demonstrate practical applications, including identifying artifacts in generated images and suggesting improved replacements for generated content. Our dataset and framework will be released to advance further research in region-level data attribution for generative models.
Poster
Yuqi Li · Haotian Zhang · Li Li · Dong Liu

[ Exhibit Hall I ]

Abstract
Context modeling is essential in learned image compression for accurately estimating the distribution of latents. While recent advanced methods have expanded context modeling capacity, they still struggle to efficiently exploit long-range dependency and diverse context information across different coding steps. In this paper, we introduce a novel Hierarchical Progressive Context Model (HPCM) for more efficient context information acquisition. Specifically, HPCM employs a hierarchical coding schedule to sequentially model the contextual dependencies among latents at multiple scales, which enables more efficient long-range context modeling. Furthermore, we propose a progressive context fusion mechanism that incorporates contextual information from previous coding steps into the current step to effectively exploit diverse contextual information. Experimental results demonstrate our method achieves state-of-the-art rate-distortion performance and strikes a better balance between compression performance and computational complexity.
Poster
Joowon Kim · Ziseok Lee · Donghyeon Cho · Sanghyun Jo · Yeonsung Jung · Kyungsu Kim · Eunho Yang

[ Exhibit Hall I ]

Abstract
Despite recent advances in diffusion models, achieving reliable image generation and editing results remains challenging due to the inherent diversity induced by stochastic noise in the sampling process. Particularly, instruction-guided image editing with diffusion models offers user-friendly editing capabilities, yet editing failures, such as background distortion, frequently occur across different attempts. Users often resort to trial and error, adjusting seeds or prompts to achieve satisfactory results, which is inefficient.While seed selection methods exist for Text-to-Image (T2I) generation, they depend on external verifiers, limiting their applicability, and evaluating multiple seeds increases computational complexity, reducing practicality.To address this, we first establish a new multiple-seed-based image editing baseline using background consistency scores, achieving Best-of-N performance without supervision. Building on this, we introduce ELECT (Early-timestep Latent Evaluation for Candidate Selection), a zero-shot framework that selects reliable seeds by estimating background mismatches at early diffusion timesteps, identfying the seed that retains the background while modifying only the foreground. ELECT ranks seed candidates by a background inconsistency score, filtering unsuitable samples early based on background consistency while fully preserving editability.Beyond standalone seed selection, ELECT integrates into instruction-guided editing pipelines and extends to Multimodal Large-Language Models (MLLMs) for joint seed + prompt selection, further improving results when …
Poster
Tong Wei · Yijun Yang · Junliang Xing · Yuanchun Shi · Zongqing Lu · Deheng Ye

[ Exhibit Hall I ]

Abstract
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs). Yet, its efficacy in training vision-language model (VLM) agents for goal-directed action reasoning in visual environments is less established. This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld. We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse, characterized by a rapid loss of diversity in the agent's thoughts, state-irrelevant and incomplete reasoning, and subsequent invalid actions, resulting in negative rewards. To counteract thought collapse, we highlight the necessity of process guidance and propose an automated corrector that evaluates and refines the agent's reasoning at each RL step. This simple and scalable GTR (Guided Thought Reinforcement) framework trains reasoning and action simultaneously without the need for dense, per-step human labeling. Our experiments demonstrate that GTR significantly enhances the performance and generalization of the LLaVA-7b model across various visual environments, achieving 3-5 times higher task success rates compared to SoTA models with notably smaller model sizes.
Poster
Jialu Gao · Joseph K J · Fernando De la Torre

[ Exhibit Hall I ]

Abstract
The task of realistically inserting a human from a reference image into a background scene is highly challenging, requiring the model to (1) determine the correct location and poses of the person and (2) perform high-quality personalization conditioned on the background. Previous approaches often treat them as separate problems, overlooking their interconnections, and typically rely on training to achieve high performance. In this work, we introduce a unified training-free pipeline that leverages pre-trained text-to-image diffusion models. We show that diffusion models inherently possess the knowledge to place people in complex scenes without requiring task-specific training. By combining inversion techniques with classifier-free guidance, our method achieves affordance-aware global editing, seamlessly inserting people into scenes. Furthermore, our proposed mask-guided self-attention mechanism ensures high-quality personalization, preserving the subject’s identity, clothing, and body features from just a single reference image. To the best of our knowledge, we are the first to perform realistic human insertions into scenes in a training-free manner and achieve state-of-the-art results in diverse composite scene images with excellent identity preservation in backgrounds and subjects.
Poster
Aniruddha Bala · Rohit Chowdhury · Rohan Jaiswal · Siddharth Roheda

[ Exhibit Hall I ]

Abstract
Advancements in diffusion models have enabled effortless image editing via text prompts, raising concerns about image security. Attackers with access to user images can exploit these tools for malicious edits. Recent defenses attempt to protect images by adding a limited noise in the pixel space to disrupt the functioning of diffusion-based editing models. However, the adversarial noise added by previous methods is easily noticeable to the human eye. Moreover, most of these methods are not robust to purification techniques like JPEG compression under a feasible pixel budget. We propose a novel optimization approach that introduces adversarial perturbations directly in the frequency domain by modifying the Discrete Cosine Transform (DCT) coefficients of the input image. By leveraging the JPEG pipeline, our method generates adversarial images that effectively prevent malicious image editing. Extensive experiments across a variety of tasks and datasets demonstrate that our approach introduces fewer visual artifacts while maintaining similar levels of edit protection and robustness to noise purification techniques.
Poster
Junlong Tong · Wei Zhang · Yaohui Jin · Xiaoyu Shen

[ Exhibit Hall I ]

Abstract
Conditional entropy models effectively leverage spatio-temporal contexts to reduce video redundancy. However, incorporating temporal context for entropy models often relies on intricate model designs, increasing complexity and computational costs. Furthermore, entropy models employing autoregressive or checkerboard strategies fail to model the significance of spatial context order, potentially limiting the availability of relevant contextual information during decoding. To address these issues, we propose the context guided transformer (CGT) entropy model, which estimates probability mass functions of the current frame conditioned on resampled temporal and importance-weighted spatial contexts. The temporal context resampler learns predefined latent queries and utilizes transformer encoders to fuse the resampled critical information while reducing subsequent computational overhead. Subsequently, we design a teacher-student network to explicitly model the importance of spatial context order. During training, the teacher network generates an attention map (i.e., importance scores) and an entropy map (i.e., confidence scores) from randomly masked inputs, guiding the student network to select top-k weighted decoding tokens as subsequent contextual information. During inference, only the student network is employed, utilizing high-importance and high-confidence tokens to guide the prediction of the remaining undecoded tokens. Experimental results demonstrate that our CGT model reduces entropy modeling time by approximately 65\% lowers the BD …
Poster
Jinghao Wang · Zhang Li · Zi Wang · Banglei Guan · Yang Shang · Qifeng Yu

[ Exhibit Hall I ]

Abstract
Recently, 6D pose confidence region estimation has emerged as a critical direction, aiming to perform uncertainty quantification for assessing the reliability of estimated poses. However, current sampling-based approach suffers from critical limitations that severely impede their practical deployment: 1) the sampling speed significantly decreases as the number of samples increases. 2) the derived confidence regions are often excessively large. To address these challenges, we propose a deterministic and efficient method for estimating pose confidence regions. Our approach uses inductive conformal prediction to calibrate the deterministically regressed Gaussian keypoint distributions into 2D keypoint confidence regions. We then leverage the implicit function theorem to propagate these keypoint confidence regions directly into 6D pose confidence regions. This method avoids the inefficiency and inflated region sizes associated with sampling and ensembling, providing compact confidence regions that cover the ground-truth poses with a user-defined confidence level. Experimental results on the LineMOD Occlusion and SPEED datasets show that our method achieves higher pose estimation accuracy with reduced computational time. For the same coverage rate, our method yields significantly smaller confidence region volumes, reducing them by up to 99.9% for rotations and 99.8% for translations. The code will be available soon.
Poster
Richard Liu · Daniel Fu · Noah Tan · Itai Lang · Rana Hanocka

[ Exhibit Hall I ]

Abstract
In this work we present WIR3D, a technique for abstracting 3D shapes through a sparse set of visually meaningful curves in 3D. We optimize the parameters of Bezier curves such that they faithfully represent both the geometry and salient visual features (e.g. texture) of the shape from arbitrary viewpoints. We leverage the intermediate activations of a pre-trained foundation model (CLIP) to guide our optimization process. We divide our optimization into two phases: one for capturing the coarse geometry of the shape, and the other for representing fine-grained features. Our second phase supervision is spatially guided by a novel localized keypoint loss. This spatial guidance enables user control over abstracted features. We ensure fidelity to the original surface through a neural SDF loss, which allows the curves to be used as intuitive deformation handles. We successfully apply our method for shape abstraction over a broad dataset of shapes with varying complexity, geometric structure, and texture, and demonstrate downstream applications for feature control and shape deformation.
Poster
Enis Simsar · Alessio Tonioni · Yongqin Xian · Thomas Hofmann · Federico Tombari

[ Exhibit Hall I ]

Abstract
We propose an unsupervised instruction-based image editing approach that removes the need for ground-truth edited images during training. Existing methods rely on supervised learning with triplets of input images, ground-truth edited images, and edit instructions. These triplets are typically generated either by existing editing methods—introducing biases—or through human annotations, which are costly and limit generalization. Our approach addresses these challenges by introducing a novel editing mechanism called Edit Reversibility Constraint (ERC), which applies forward and reverse edits in one training step and enforces alignment in image, text, and attention spaces. This allows us to bypass the need for ground-truth edited images and unlock training for the first time on datasets comprising either real image-caption pairs or image-caption-instruction triplets. We empirically show that our approach performs better across a broader range of edits with high-fidelity and precision. By eliminating the need for pre-existing datasets of triplets, reducing biases associated with current methods, and proposing ERC, our work represents a significant advancement in unblocking scaling of instruction-based image editing.
Poster
Zeyu Liu · Zanlin Ni · Yeguo Hua · Xin Deng · Xiao Ma · Cheng Zhong · Gao Huang

[ Exhibit Hall I ]

Abstract
Discrete visual tokenizers transform images into a sequence of tokens, enabling token-based visual generation akin to language models. However, this process is inherently challenging, as it requires both \emph{compressing} visual signals into a compact representation and \emph{discretizing} them into a fixed set of codes. Traditional discrete tokenizers typically learn the two tasks jointly, often leading to unstable training, low codebook utilization, and limited reconstruction quality. In this paper, we introduce \textbf{CODA}(\textbf{CO}ntinuous-to-\textbf{D}iscrete \textbf{A}daptation), a framework that decouples compression and discretization. Instead of training discrete tokenizers from scratch, CODA adapts off-the-shelf continuous VAEs---already optimized for perceptual compression---into discrete tokenizers via a carefully designed discretization process. By primarily focusing on discretization, CODA ensures stable and efficient training while retaining the strong visual fidelity of continuous VAEs. Empirically, With $\mathbf{6 \times}$ less training budget compared to standard VQGAN, our approach achieves a remarkable codebook utilization of \textbf{100\%} and notable reconstruction FID (rFID) of $\mathbf{0.43}$ and $\mathbf{1.34}$ for $8 \times$ and $16 \times$ compression respectively.
Poster
Yiyang Wang · Xi Chen · Xiaogang Xu · Sihui Ji · Yu Liu · Yujun Shen · Hengshuang Zhao

[ Exhibit Hall I ]

Abstract
In spite of recent progress, image diffusion models still produce artifacts. A common solution is to leverage the feedback provided by quality assessment systems or human annotators to optimize the model, where images are generally rated in their entirety. In this work, we believe $\textbf{problem-solving starts with identification}$, yielding the request that the model should be aware of not just the presence of defects in an image, but their specific locations. Motivated by this, we propose DiffDoctor, a two-stage pipeline to assist image diffusion models in generating fewer artifacts. Concretely, the first stage targets developing a robust artifact detector, for which we collect a dataset of over 1M flawed synthesized images and set up an efficient human-in-the-loop annotation process, incorporating a carefully designed class-balance strategy. The learned artifact detector is then involved in the second stage to optimize the diffusion model by providing pixel-level feedback. Extensive experiments on text-to-image diffusion models demonstrate the effectiveness of our artifact detector as well as the soundness of our diagnose-then-treat design.
Poster
Ruidong Chen · honglin guo · Lanjun Wang · Chenyu Zhang · Weizhi Nie · Anan Liu

[ Exhibit Hall I ]

Abstract
Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images. To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts. However, current studies struggle to fully erase malicious concepts implicitly embedded in prompts (e.g., metaphorical expressions or adversarial prompts) while preserving the model's normal generation capability. To address this challenge, our study proposes TRCE, using a two-stage concept erasure strategy to achieve an efficient trade-off between reliable erasure and knowledge preservation. Firstly, TRCE starts by erasing the malicious semantics implicitly embedded in textual prompts. By identifying an effective mapping objective(i.e., the [EoT] embedding), we optimize the cross-attention layers to map malicious prompts to contextually similar prompts but with safe concepts. This step prevents the model from being overly influenced by malicious semantics during the denoising process. Following this, considering the deterministic properties of the sampling trajectory of the diffusion model, TRCE further steers the early denoising prediction toward the safe direction and away from the unsafe one through contrastive learning, thus further avoiding the generation of malicious content. Finally, we conduct comprehensive evaluations of TRCE on multiple malicious concept erasure benchmarks, and the …
Poster
Hengrui Kang · Siwei Wen · Zichen Wen · Junyan Ye · Weijia Li · Peilin Feng · Baichuan Zhou · Bin Wang · Dahua Lin · Linfeng Zhang · Conghui He

[ Exhibit Hall I ]

Abstract
The rapid advancements in generative technology have emerged as a double-edged sword. While offering powerful tools that enhance convenience, they also pose significant social concerns. As defenders, current synthetic image detection methods often lack artifact-level textual interpretability and are overly focused on image manipulation detection, and current datasets usually suffer from outdated generators and a lack of fine-grained annotations. In this paper, we introduce SynthScars, a high-quality and diverse dataset consisting of 12,236 fully synthetic images with human-expert annotations. It features 4 distinct image content types, 3 categories of artifacts, and fine-grained annotations covering pixel-level segmentation, detailed textual explanations, and artifact category labels. Furthermore, we propose LEGION (LEarning to Ground and explain for Synthetic Image detectiON), a multimodal large language model~(MLLM)-based image forgery analysis framework that integrates artifact detection, segmentation, and explanation.Building upon this capability, we further explore LEGION as a controller, integrating it into image refinement pipelines to guide the generation of higher-quality and more realistic images. Extensive experiments show that LEGION outperforms existing methods across multiple benchmarks, particularly surpassing the second-best traditional expert on SynthScars by 3.31\% in mIoU and 7.75\% in F1 score. Moreover, the refined images generated under its guidance exhibit stronger alignment with human preferences. …
Poster
Zheng-Peng Duan · jiawei zhang · Xin Jin · Ziheng Zhang · Zheng Xiong · Dongqing Zou · Jimmy Ren · Chun-Le Guo · Chongyi Li

[ Exhibit Hall I ]

Abstract
Large-scale pre-trained diffusion models are becoming increasingly popular in solving the Real-World Image Super-Resolution (Real-ISR) problem because of their rich generative priors.The recent development of diffusion transformer (DiT) has witnessed overwhelming performance over the traditional UNet-based architecture in image generation,which also raises the question: Can we adopt the advanced DiT-based diffusion model for Real-ISR?To this end,we propose our DiT4SR, one of the pioneering works to tame the large-scale DiT model for Real-ISR.Instead of directly injecting embeddings extracted from low-resolution (LR) images like ControlNet,we integrate the LR embeddings into the original attention mechanism of DiT, allowing for the bidirectional flow of information between the LR latent and the generated latent.The sufficient interaction of these two streams allows the LR stream to evolve with the diffusion process, producing progressively refined guidance that better aligns with the generated latent at each diffusion step.Additionally, the LR guidance is injected into the generated latent via a cross-stream convolution layer, compensating for DiT's limited ability to capture local information.These simple but effective designs endow the DiT model with superior performance in Real-ISR,which is demonstrated by extensive experiments.The code will be available to the community.
Poster
Mian Zou · Nan Zhong · Baosheng Yu · Yibing Zhan · Kede Ma

[ Exhibit Hall I ]

Abstract
Supervised learning has been the dominant approach for developing detectors of AI-generated face images. However, the reliance on pre-generated face samples often limits the adaptability to the diverse and rapidly evolving landscape of AI face generators. Here, we propose a bi-level optimization framework for self-supervised AI-generated face detection, relying solely on photographic images and aligning the pretext tasks with the downstream AI face detection. The inner loop optimization aims to train a feature extractor using linearly weighted objectives of several pretext tasks, including classifying categorical exchangeable image file format (EXIF) tags, ranking ordinal EXIF tags, and identifying global and local face manipulations. The outer loop optimization treats the coarse-grained detection of face manipulations as a surrogate task for AI-generated image detection, directing the feature extractor to adapt to detecting AI faces by optimizing the linear weightings to align the task relationships. To validate the effectiveness of our self-supervised features, we first frame AI-generated face detection as one-class classification, and model the feature distribution of photographic images using a Gaussian mixture model. Faces with low likelihoods are flagged as AI-generated. Additionally, we train a two-layer perceptron based on the extracted self-supervised features as a simple binary classifier. We demonstrate by comprehensive …
Poster
Zhong-Yu Li · Ruoyi Du · Juncheng Yan · Le Zhuo · Zhen Li · Peng Gao · Zhanyu Ma · Ming-Ming Cheng

[ Exhibit Hall I ]

Abstract
Recent advances in diffusion models have significantly advanced image generation; however, existing models remain task-specific, limiting their efficiency and generalizability. While universal models attempt to address these limitations, they face critical challenges, including generalizable instruction design, appropriate task distributions, and unified architectural design. In this work, we propose VisualCloze, a universal image generation framework, to tackle these challenges. Unlike existing methods that rely on language-based task descriptions, leading to task ambiguity and weak generalization, we integrate visual in-context learning, allowing models to identify tasks from demonstrations. Meanwhile, the inherent sparsity of visual task distributions hampers the learning of transferable knowledge across tasks. To this end, we introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and knowledge transfer. Furthermore, we uncover an intrinsic alignment between image infilling and in-context learning, enabling us to leverage the strong generative priors of pre-trained infilling models without modifying their architectures. Experiments demonstrate that VisualCloze achieves strong performance across more than 100 in-domain tasks while generalizing to unseen tasks in few-shot and zero-shot settings.
Poster
Zexi Jia · Chuanwei Huang · Hongyan Fei · Yeshuang Zhu · Zhiqiang Yuan · Ying Deng · Jiapei Zhang · Jinchao Zhang · Jie Zhou

[ Exhibit Hall I ]

Abstract
Current legal frameworks consider AI-generated works eligible for copyright protection when they meet originality requirements and involve substantial human intellectual input. However, systematic legal standards and reliable evaluation methods for AI art copyrights are lacking. Through comprehensive analysis of legal precedents, we establish three essential criteria for determining distinctive artistic style: stylistic consistency, creative uniqueness, and expressive accuracy. To address these challenges, we introduce ArtBulb, an interpretable and quantifiable framework for AI art copyright judgment that combines a novel style description-based multimodal clustering method with multimodal large language models (MLLMs). We also present AICD, the first benchmark dataset for AI art copyright annotated by artists and legal experts. Experimental results demonstrate that ArtBulb outperforms existing models in both quantitative and qualitative evaluations. Our work aims to bridge the gap between the legal and technological communities and bring greater attention to the societal issue of AI art copyrights.
Poster
Wenshuo Gao · Xicheng Lan · Shuai Yang

[ Exhibit Hall I ]

Abstract
Despite the rapid advancements in video generation technology, creating high-quality videos that precisely align with user intentions remains a significant challenge. Existing methods often fail to achieve fine-grained control over video details, limiting their practical applicability. We introduce AnyPortal, a novel zero-shot framework for video background replacement that leverages pre-trained diffusion models. Our framework collaboratively integrates the temporal prior of video diffusion models with the relighting capabilities of image diffusion models in a zero-shot setting. To address the critical challenge of foreground consistency, we propose a Refinement Projection Algorithm, which enables pixel-level detail manipulation to ensure precise foreground preservation. AnyPortal is training-free and overcomes the challenges of achieving foreground consistency and temporally coherent relighting. Experimental results demonstrate that AnyPortal achieves high-quality results on consumer-grade GPUs, offering a practical and efficient solution for video content creation and editing.
Poster
Yefei He · Yuanyu He · Shaoxuan He · Feng Chen · Hong Zhou · Kaipeng Zhang · Bohan Zhuang

[ Exhibit Hall I ]

Abstract
Visual autoregressive models typically adhere to a raster-order "next-token prediction" paradigm, which overlooks the spatial and temporal locality inherent in visual content. Specifically, visual tokens exhibit significantly stronger correlations with their spatially or temporally adjacent tokens compared to those that are distant.In this paper, we propose Neighboring Autoregressive Modeling (NAR), a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure, following a near-to-far "next-neighbor prediction" mechanism.Starting from an initial token, the remaining tokens are decoded in ascending order of their Manhattan distance from the initial token in the spatial-temporal space, progressively expanding the boundary of the decoded region.To enable parallel prediction of multiple adjacent tokens in the spatial-temporal space, we introduce a set of dimension-oriented decoding heads, each predicting the next token along a mutually orthogonal dimension.During inference, all tokens adjacent to the decoded tokens are processed in parallel, substantially reducing the model forward steps for generation.Experiments on ImageNet 256$\times$256 and UCF101 demonstrate that NAR achieves 2.4$\times$ and 8.6$\times$ higher throughput respectively, while obtaining superior FID/FVD scores for both image and video generation tasks compared to the PAR-4X approach.When evaluating on text-to-image generation benchmark GenEval, NAR with 0.8B parameters outperforms Chameleon-7B while using merely 0.4% of the …
Poster
Hang Guo · Yawei Li · Taolin Zhang · Jiangshan Wang · Tao Dai · Shu-Tao Xia · Luca Benini

[ Exhibit Hall I ]

Abstract
Visual Autoregressive (VAR) modeling has gained popularity for its shift towards next-scale prediction. However, existing VAR paradigms process the entire token map at each scale step, leading to the complexity and runtime scaling dramatically with image resolution. To address this challenge, we propose FastVAR, a post-training acceleration method for efficient resolution scaling with VARs. Our key finding is that the majority of latency arises from the large-scale step where most tokens have already converged. Leveraging this observation, we develop the cached token pruning strategy that only forwards pivotal tokens for scale-specific modeling while using cached tokens from previous scale steps to restore the pruned slots. This significantly reduces the number of forwarded tokens and improves the efficiency at larger resolutions. Experiments show the proposed FastVAR can further speedup FlashAttention-accelerated VAR by 2.7x with negligible performance drop of <1%. We further extend \NAME to zero-shot generation of higher resolution images. In particular, FastVAR can generate one 2K image with 15GB memory footprints in 1.5s on a single NVIDIA 3090 GPU.
Poster
ying ba · Tianyu Zhang · Yalong Bai · Wenyi Mo · Tao Liang · Bing Su · Ji-Rong Wen

[ Exhibit Hall I ]

Abstract
Contemporary image generation systems have achieved high fidelity and superior aesthetic quality beyond basic text-image alignment. However, existing evaluation frameworks have failed to evolve in parallel. This study reveals that human preference reward models fine-tuned based on CLIP and BLIP architectures have inherent flaws: they inappropriately assign low scores to images with rich details and high aesthetic value, creating a significant discrepancy with actual human aesthetic preferences. To address this issue, we design a novel evaluation score, ICT (Image-Contained-Text) score, that achieves and surpasses the objectives of text-image alignment by assessing the degree to which images represent textual content. Building upon this foundation, we further train a HP (High-Preference) score model using solely the image modality, aiming to enhance image quality in aspects such as aesthetics and detail refinement while maintaining achieved text-image alignment.Experiments demonstrate that the proposed evaluation model improves scoring accuracy by over 10\% compared to existing methods, and achieves significant results in optimizing state-of-the-art text-to-image models. This research provides theoretical foundation and empirical support for the evolution of image generation technology toward better alignment with higher-order human aesthetic preferences.
Poster
Yian Zhao · rushi ye · Ruochong Zheng · Zesen Cheng · Chaoran Feng · Jiashu Yang · Pengchong Qiao · Chang Liu · Jie Chen

[ Exhibit Hall I ]

Abstract
3D style transfer refers to the artistic stylization of 3D assets based on reference style images. Recently, 3DGS-based stylization methods have drawn considerable attention, primarily due to their markedly enhanced training and rendering speeds. However, a vital challenge for 3D style transfer is to strike a balance between the content and the patterns and colors of the style. Although the existing methods strive to achieve relatively balanced outcomes, the fixed-output paradigm struggles to adapt to the diverse content-style balance requirements from different users. In this work, we introduce a creative intensity-tunable 3D style transfer paradigm, dubbed Tune-Your-Style, which allows users to flexibly adjust the style intensity injected into the scene to match their desired content-style balance, thus enhancing the customizability of 3D style transfer. To achieve this goal, we first introduce Gaussian neurons to explicitly model the style intensity and parameterize a learnable style tuner to achieve intensity-tunable style injection. To facilitate the learning of tunable stylization, we further propose the tunable stylization guidance, which obtains multi-view consistent stylized views from diffusion models through cross-view style alignment, and then employs a two-stage optimization strategy to provide stable and efficient guidance by modulating the balance between full-style guidance from the stylized …
Poster
Tiancheng SHEN · Jun Hao Liew · Zilong Huang · Xiangtai Li · Zhijie Lin · Jiyang Liu · Yitong Wang · Jiashi Feng · Ming-Hsuan Yang

[ Exhibit Hall I ]

Abstract
Multimodal Diffusion Transformers (MM-DiTs) have recently emerged as a powerful framework for unified text-vision synthesis, surpassing traditional U-Net architectures in generative tasks. One key innovation lies in its Multimodal Self-Attention (MM-SA) interaction where image and text tokens are concatenated and processed via self-attention.However, this mechanism poses significant challenges for editing, rendering conventional U-Net-based attention manipulation methods ineffective. To address this limitation, we propose QK-Edit, a training-free framework that exploits the unique attention dynamics of MM-DiTs for precise text-guided image and video editing. By introducing a novel query-key manipulation strategy, our method isolates and adjusts critical attention components to achieve an optimal balance between prompt fidelity and structural consistency. This enables seamless edits across various tasks, including object addition, object removal, object replacement, changing background, changing material, changing color, and style transformation. Notably, it can be easily implemented with feature replacement in inference.QK-Edit demonstrates superior editing performance on state-of-the-art models, such as FLUX and HunyuanVideo, effectively bridging the gap between generative power and editable flexibility in MM-DiTs, and paving the way for scalable multimodal content creation. The code will be made publicly available.
Poster
Gang Dai · Yifan Zhang · Yutao Qin · Qiangya Guo · Shuangping Huang · Shuicheng YAN

[ Exhibit Hall I ]

Abstract
Existing handwritten text generation methods primarily focus on isolated words. However, realistic handwritten text demands attention not only to individual words but also to the relationships between them, such as vertical alignment and horizontal spacing. Therefore, generating entire text line emerges as a more promising and comprehensive task. However, this task poses significant challenges, including the accurate modeling of complex style patterns—encompassing both intra- and inter-word relationships—and maintaining content accuracy across numerous characters. To address these challenges, we propose DiffBrush, a novel diffusion-based model for handwritten text-line generation. Unlike existing methods, DiffBrush excels in both style imitation and content accuracy through two key strategies: (1) content-decoupled style learning, which disentangles style from content to better capture intra-word and inter-word style patterns by using column- and row-wise masking; and (2) multi-scale content learning, which employs line and word discriminators to ensure global coherence and local accuracy of textual content. Extensive experiments show that DiffBrush excels in generating high-quality text-lines, particularly in style reproduction and content preservation. Our source code will be made publicly available.
Poster
hongji yang · Wencheng Han · Yucheng Zhou · Jianbing Shen

[ Exhibit Hall I ]

Abstract
In this paper, we introduce DC (Decouple)-ControlNet, a highly flexible and precisely controllable framework for multi-condition image generation. The core idea behind DC-ControlNet is to decouple control conditions, transforming global control into a hierarchical system that integrates distinct elements, contents, and layouts. This enables users to mix these individual conditions with greater flexibility, leading to more efficient and accurate image generation control. Previous ControlNet-based models rely solely on global conditions, which affect the entire image and lack the ability of element- or region-specific control. This limitation reduces flexibility and can cause condition misunderstandings in multi-conditional image generation. To address these challenges, we propose both intra-element and inter-element Controllers in DC-ControlNet. The Intra-Element Controller handles different types of control signals within individual elements, accurately describing the content and layout characteristics of the object. For interactions between elements, we introduce the Inter-Element Controller, which accurately handles multi-element interactions and occlusion based on user-defined relationships. Extensive evaluations show that DC-ControlNet significantly outperforms existing ControlNet models and Layout-to-Image generative models in terms of control flexibility and precision in multi-condition control.
Poster
Ruotong Wang · Mingli Zhu · Jiarong Ou · Rui Chen · Xin Tao · Pengfei Wan · Baoyuan Wu

[ Exhibit Hall I ]

Abstract
Text-to-video (T2V) generative models have rapidly advanced and found widespread applications across fields like entertainment, education, and marketing. However, the adversarial vulnerabilities of these models remain rarely explored. We observe that in T2V generation tasks, the generated videos often contain substantial redundant information not explicitly specified in the text prompts, such as environmental elements, secondary objects, and additional details, providing opportunities for malicious attackers to embed hidden harmful content.Exploiting this inherent redundancy, we introduce BadVideo, the first backdoor attack framework tailored for T2V generation. Our attack focuses on designing target adversarial outputs through two key strategies: (1) Spatio-Temporal Composition, which combines different spatiotemporal features to encode malicious information;(2) Dynamic Element Transformation, which introduces transformations in redundant elements over time to convey malicious information.Based on these strategies, the attacker's malicious target seamlessly integrates with the user's textual instructions, providing high stealthiness. Moreover, by exploiting the temporal dimension of videos, our attack successfully evades traditional content moderation systems that primarily analyze spatial information within individual frames.Extensive experiments demonstrate that BadVideo achieves high attack success rates while preserving original semantics and maintaining excellent performance on clean inputs. Overall, our work reveals the adversarial vulnerability of T2V models, calling attention to potential risks and …
Poster
Hailong Guo · Bohan Zeng · Yiren Song · Wentao Zhang · Jiaming Liu · Chuang Zhang

[ Exhibit Hall I ]

Abstract
Image-based virtual try-on (VTON) aims to generate a virtual try-on result by transferring an input garment onto a target person’s image. However, the scarcity of paired garment-model data makes it challenging for existing methods to achieve high generalization and quality in VTON. Also, it limits the ability to generate mask-free try-ons. To tackle the data scarcity problem, approaches such as Stable Garment and MMTryon use a synthetic data strategy, effectively increasing the amount of paired data on the model side. However, existing methods are typically limited to performing specific try-on tasks and lack user-friendliness. To enhance the generalization and controllability of VTON generation, we propose Any2AnyTryon, which can generate try-on results based on different textual instructions and model garment images to meet various needs, eliminating the reliance on masks, poses, or other conditions. Specifically, we first construct the virtual try-on dataset LAION-Garment, the largest known open-source garment try-on dataset. Then, we introduce adaptive position embedding, which enables the model to generate satisfactory outfitted model images or garment images based on input images of different sizes and categories, significantly enhancing the generalization and controllability of VTON generation. In our experiments, we demonstrate the effectiveness of our Any2AnyTryon and compare it with …
Poster
Wenzhuang Wang · Yifan Zhao · Mingcan Ma · Ming Liu · Zhonglin Jiang · Yong Chen · Jia Li

[ Exhibit Hall I ]

Abstract
Layout-to-image (L2I) generation has exhibited promising results in natural image generation, but they face challenges and even fail when applied to degraded scenarios (\ie, low-light, underwater). This is primarily attributed to the ``contextual illusion dilemma'' within degraded contexts, where foreground instances are overwhelmed by context-dominant frequency distributions. Motivated by this, our paper proposes a new Frequency-Inspired Contextual Disentanglement Generative (FICGen) paradigm, which seeks to transfer frequency-aware knowledge (\ie, edges, textures) into the latent diffusion space, thereby better rendering the degraded instances via frequency-aware guidance. To be specific, FICGen consists of two major steps. First, we introduce a learnable dual-query mechanism, each paired with individual frequency resamplers, to perceive contextual frequency prototypes disentangled by degraded images. Subsequently, a visual-frequency enhanced attention is employed to incorporate the frequency knowledge within these prototypes into the degraded instance generation process. Second, to alleviate the attribute leakage and compensate for sample loss in dense and small objects, we propose an instance coherence map to regulate instance isolation, coupled with an adaptive spatial-frequency aggregation module to merge them in a spatial-frequency mixed manner. Extensive quantitative and qualitative experiments against L2I methods on four benchmarks illustrate superior quality and trainability of FICGen towards diverse degradation circumstances.
Poster
Jathushan Rajasegaran · Ilija Radosavovic · Rahul Ravishankar · Yossi Gandelsman · Christoph Feichtenhofer · Jitendra Malik

[ Exhibit Hall I ]

Abstract
We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate.
Poster
Etai Sella · Noam Atia · Ron Mokady · Hadar Averbuch-Elor

[ Exhibit Hall I ]

Abstract
Natural language offers a highly intuitive interface for enabling localized fine-grained edits of 3D shapes. However, prior works face challenges in preserving global coherence while locally modifying the input 3D shape. In this work, we introduce an inpainting-based framework for editing shapes represented as point clouds. Our approach leverages foundation 3D diffusion models for achieving localized shape edits, adding structural guidance in the form of a partial conditional shape, ensuring that other regions correctly preserve the shape's identity. Furthermore, to encourage identity preservation also within the local edited region, we propose an inference-time coordinate blending algorithm which balances reconstruction of the full shape with inpainting at a progression of noise levels during the inference process. Our coordinate blending algorithm seamlessly blends the original shape with its edited version, enabling a fine-grained editing of 3D shapes, all while circumventing the need for computationally expensive and often innacurate inversion. Extensive experiments show that our method outperforms alternative techniques across a wide range of metrics that evaluate both fidelity to the original shape and also adherence to the textual description. We will release our code and trained models.
Poster
Hanshen Zhu · Zhen Zhu · Kaile Zhang · Yiming Gong · Yuliang Liu · Xiang Bai

[ Exhibit Hall I ]

Abstract
We tackle the problem of geometric image editing, where an object within an image is repositioned, reoriented, or reshaped while preserving overall scene coherence. Previous diffusion-based editing methods often attempt to handle all relevant subtasks in a single step, which proves difficult when transformations become large or structurally complex. We address this by proposing a decoupled pipeline that separates object transformation, source region inpainting, and target region refinement. Both inpainting and refinement are implemented using a training-free diffusion approach, FreeFine. In experiments on our new GeoBench benchmark, which contains both 2D and 3D editing scenarios, FreeFine outperforms state-of-the-art alternatives in image fidelity and edit precision, especially under demanding transformations. We will release our codes and benchmark when the paper becomes publicly available.
Poster
Seunghyun Shin · Dongmin Shin · Jisu Shin · Hae-Gon Jeon · Joon-Young Lee

[ Exhibit Hall I ]

Abstract
Different from color correction and transfer, color grading involves adjusting colors for artistic or storytelling purposes in a video, which is used to establish a specific look or mood. However, due to the complexity of the process and the need for specialized editing skills, video color grading remains primarily the domain of professional colorists. In this paper, we present a reference-based video color grading framework. Our key idea is explicitly generating a look-up table (LUT) for color attribute alignment between reference scenes and input video via a diffusion model. As a training objective, we enforce that high-level features of the reference scenes like look, mood, and emotion should be similar to that of the input video. Our LUT-based approach allows for color grading without any loss of structural details in the whole video frames as well as achieving fast inference. We further build a pipeline to incorporate a user-preference via text prompts for low-level feature enhancement such as contrast and brightness, etc. Experimental results, including extensive user studies, demonstrate the effectiveness of our approach for video color grading. To validate its robustness, we provide our source code and video demo as supplementary materials.
Poster
Do Dat · Nam Hyeon-Woo · Po-Yuan Mao · Tae-Hyun Oh

[ Exhibit Hall I ]

Abstract
Text-to-image diffusion models have shown impressive capabilities in generating realistic visuals from natural-language prompts, yet they often struggle with accurately binding attributes to corresponding objects, especially in prompts containing multiple attribute-object pairs. This challenge primarily arises from the limitations of commonly used text encoders, such as CLIP, which can fail to encode complex linguistic relationships and modifiers effectively. Existing approaches have attempted to mitigate these issues through attention map control during inference and the use of layout information or fine-tuning during training, yet they face performance drops with increased prompt complexity. In this work, we introduce a novel compositional generation method that leverages pairwise image embeddings to improve attribute-object binding. Our approach decomposes complex prompts into sub-prompts, generates corresponding images, and computes visual prototypes that fuse with text embeddings to enhance representation. By applying segmentation-based localization training, we address cross-attention misalignment, achieving improved accuracy in binding multiple attributes to objects. Our approaches outperform existing compositional text-to-image diffusion models on the benchmark T2I CompBench, achieving better image quality, evaluated by humans, and emerging robustness under scaling number of binding pairs in the prompt.
Poster
Junfei Xiao · Feng Cheng · Lu Qi · Liangke Gui · Yang Zhao · Shanchuan Lin · Jiepeng Cen · Zhibei Ma · Alan Yuille · Lu Jiang

[ Exhibit Hall I ]

Abstract
Recent video generation models have shown promising results in producing high-quality video clips lasting several seconds. However, these models face challenges in generating long sequences that convey clear and informative events, limiting their ability to support coherent narrations. In this paper, we present a large-scale cooking video dataset designed to advance long-form narrative generation in the cooking domain. We validate the quality of our proposed dataset in terms of visual fidelity and textual caption accuracy using state-of-the-art Vision-Language Models (VLMs) and video generation models, respectively. We further introduce a Long Narrative Video Director to enhance both visual and semantic coherence in generated videos and emphasize the role of aligning visual embeddings to achieve improved overall video quality. Our method demonstrates substantial improvements in generating visually detailed and semantically aligned keyframes, supported by finetuning techniques that integrate text and image embeddings within the video generation process. Codes and data will be made publicly available.
Poster
Jiwoo Chung · Sangeek Hyun · Hyunjun Kim · Eunseo Koh · Minkyu Lee · Jae-Pil Heo

[ Exhibit Hall I ]

Abstract
Recent advances in text-to-image generative models have enabled numerous practical applications, including subject-driven generation, which fine-tunes pre-trained models to capture subject semantics from only a few examples. While diffusion-based models produce high-quality images, their extensive denoising steps result in significant computational overhead, limiting real-world applicability.Visual Auto-Regressive (VAR) models, which predict next-scale tokens rather than spatially adjacent ones, offer significantly faster inference suitable for practical deployment. In this paper, we propose the first VAR-based approach for subject-driven generation. However, naive fine-tuning VAR leads to computational overhead, language drift, and reduced diversity. To address these challenges, we introduce selective layer tuning to reduce complexity and prior distillation to mitigate language drift. Additionally, we found that the early stages have a greater influence on the generation of subject than the latter stages, which merely synthesize local details. Based on this finding, we propose scale-wise weighted tuning, which prioritizes coarser resolutions for promoting the model to focus on the subject-relevant information instead of local details. Extensive experiments validate that our method significantly outperforms diffusion-based baselines across various metrics and demonstrates its practical usage.
Poster
En Ci · Shanyan Guan · Yanhao Ge · Yilin Zhang · Wei Li · Zhenyu Zhang · Jian Yang · Ying Tai

[ Exhibit Hall I ]

Abstract
Despite the progress in text-to-image generation, semantic image editing remains a challenge. Inversion-based methods introduce reconstruction errors and inefficiencies, while instruction-based models suffer from limited datasets, architectural constraints, and high computational costs. We propose DescriptiveEdit, a description-driven editing framework that preserves the generative power of pre-trained T2I models without architectural modifications or inversion. A Cross-Attentive UNet with an attention bridge enables direct feature fusion, while LoRA-based tuning ensures efficiency and compatibility. Without retraining, DescriptiveEdit seamlessly integrates with ControlNet, IP-Adapter, and other extensions. Experiments show it improves editing accuracy and consistency while significantly reducing computational costs, providing a scalable and flexible solution for text-guided image manipulation.
Poster
Zheng Gao · Jifei Song · Zhensong Zhang · Jiankang Deng · Ioannis Patras

[ Exhibit Hall I ]

Abstract
Current training-free text-driven image translation primarily uses diffusion features (convolution and attention) of pre-trained model as guidance to preserve the style/structure of source image in translated image. However, the coarse guidance at feature level struggles with style (e.g., visual patterns) and structure (e.g., edges) alignment with the source. Based on the observation that the low-/high-frequency components retain style/structure information of image, in this work, we propose training-free Frequency-Guided Diffusion (FGD), which tailors low-/high-frequency guidance for style- and structure-guided translation, respectively. For low-frequency guidance (style-guided), we substitute the low-frequency components of diffusion latents from sampling process with those from inversion of source and normalize the obtained latent with composited spectrum to enforce color alignment. For high-frequency guidance (structure-guided), we propose high-frequency alignment and high-frequency injection that compensate each other. High-frequency alignment preserves edges and contour by adjusting the predicted noise with guidance function that aligns high-frequency image regions between sampling and source image. High-frequency injection facilitates layout preservation by injecting high-frequency components of diffusion convolution features (from inversion) to sampling process. Qualitative and quantitative results verify the superiority of our method on style- and structure-guided translation tasks. We make the code publicly available at: withheld during review.
Poster
Ming Li · Xin Gu · Fan Chen · Xiaoying Xing · Longyin Wen · Chen Chen · Sijie Zhu

[ Exhibit Hall I ]

Abstract
Due to the challenges of manually collecting accurate editing data, existing datasets are typically constructed using various automated methods, leading to noisy supervision signals caused by the mismatch between editing instructions and original-edited image pairs. Recent efforts attempt to improve editing models through generating higher-quality edited images, pre-training on recognition tasks, or introducing vision-language models (VLMs) but fail to resolve this fundamental issue. In this paper, we offer a novel solution by constructing more effective editing instructions for given image pairs. This includes rectifying the editing instructions to better align with the original-edited image pairs and using contrastive editing instructions to further enhance their effectiveness. Specifically, we find that editing models exhibit specific generation attributes at different inference steps, independent of the text. Based on these prior attributes, we define a unified guide for VLMs to rectify editing instructions. However, there are some challenging editing scenarios that cannot be resolved solely with rectified instructions. To this end, we further construct contrastive supervision signals with positive and negative instructions and introduce them into the model training using triplet loss, thereby further facilitating supervision effectiveness. Our method does not require the VLM modules or pre-training tasks used in previous work, offering a …
Poster
Rongkun Xue · Jinouwen Zhang · Yazhe Niu · Dazhong Shen · Bingqi Ma · Yu Liu · Jing Yang

[ Exhibit Hall I ]

Abstract
Recent generative models based on score matching and flow matching have significantly advanced generation tasks, but their potential in discriminative tasks remains underexplored. Previous approaches, such as generative classifiers, have not fully leveraged the capabilities of these models for discriminative tasks due to their intricate designs. We propose Pretrained Reversible Generation (PRG), which extracts unsupervised representations by reversing the generative process of a pretrained continuous generation model. PRG effectively reuses unsupervised generative models, leveraging their high capacity to serve as robust and generalizable feature extractors for downstream tasks. This framework enables the flexible selection of feature hierarchies tailored to specific downstream tasks. Our method consistently outperforms prior approaches across multiple benchmarks, achieving state-of-the-art performance among generative model based methods, including 78% top-1 accuracy on ImageNet at a resolution of 64×64. Extensive ablation studies, including out-of-distribution evaluations, further validate the effectiveness of our approach.
Poster
Naifu Xue · Zhaoyang Jia · Jiahao Li · Bin Li · Yuan Zhang · Yan Lu

[ Exhibit Hall I ]

Abstract
Recent studies in extreme image compression have achieved remarkable performance by compressing the tokens from generative tokenizers. However, these methods often prioritize clustering common semantics within the dataset, while overlooking the diverse details of individual objects. Consequently, this results in suboptimal reconstruction fidelity, especially at low bitrates. To address this issue, we introduce a Dual-generative Latent Fusion (DLF) paradigm. DLF decomposes the latent into semantic and detail elements, compressing them through two distinct branches. The semantic branch clusters high-level information into compact tokens, while the detail branch encodes perceptually critical details to enhance the overall fidelity. Additionally, we propose a cross-branch interactive design to reduce redundancy between the two branches, thereby minimizing the overall bit cost. Experimental results demonstrate the impressive reconstruction quality of DLF even below 0.01 bits per pixel (bpp). On the CLIC2020 test set, our method achieves bitrate savings of up to 27.93% on LPIPS and 53.55% on DISTS compared to MS-ILLM. Furthermore, DLF surpasses recent diffusion-based codecs in visual fidelity while maintaining a comparable level of generative realism. Code will be available later.
Poster
Rui Tian · Qi Dai · Jianmin Bao · Kai Qiu · Yifan Yang · Chong Luo · Zuxuan Wu · Yu-Gang Jiang

[ Exhibit Hall I ]

Abstract
Commercial video generation models have exhibited realistic, high-fidelity results but are still restricted to limited access.One crucial obstacle for large-scale applications is the expensive training and inference cost.In this paper, we argue that videos contain significantly more redundant information than images, allowing them to be encoded with very few motion latents.Towards this goal, we design an image-conditioned VAE that projects videos into extremely compressed latent space and decode them based on content images. This magic Reducio charm enables 64$\times$ reduction of latents compared to a common 2D VAE, without sacrificing the quality.Building upon Reducio-VAE, we can train diffusion models for high-resolution video generation efficiently. Specifically, we adopt a two-stage generation paradigm, first generating a condition image via text-to-image generation, followed by text-image-to-video generation with the proposed Reducio-DiT. Extensive experiments show that our model achieves strong performance in evaluation.More importantly, our method significantly boosts the training and inference efficiency of video LDMs. Reducio-DiT is trained in just 3.2K A100 GPU hours in total and can generate a 16-frame 1024$\times$1024 video clip within 15.5 seconds on a single A100 GPU.
Poster
Yichen Lu · Siwei Nie · Minlong Lu · Xudong Yang · Xiaobo Zhang · Peng Zhang

[ Exhibit Hall I ]

Abstract
Image Copy Detection (ICD) aims to identify manipulated content between image pairs through robust feature representation learning. While self-supervised learning (SSL) has advanced ICD systems, existing view-level contrastive methods struggle with sophisticated edits due to insufficient fine-grained correspondence learning. We address this limitation by exploiting the inherent geometric traceability in edited content through two key innovations. First, we propose PixTrace - a pixel coordinate tracking module that maintains explicit spatial mappings across editing transformations. Second, we introduce CopyNCE, a geometrically-guided contrastive loss that regularizes patch affinity using overlap ratios derived from PixTrace's verified mappings. Our method bridges pixel-level traceability with patch-level similarity learning, suppressing supervision noise in SSL training. Extensive experiments demonstrate not only state-of-the-art performance (88.7\% $\mu$AP / 83.9\% RP90 for matcher, 72.6\% $\mu$AP / 68.4\% RP90 for descriptor on DISC21 dataset) but also better interpretability over existing methods. Code is available.
Poster
Haodong Jing · Dongyao Jiang · Yongqiang Ma · Haibo Hua · Bo Huang · Nanning Zheng

[ Exhibit Hall I ]

Abstract
Decoding visual information from fMRI signals is an important pathway to understand how the brain represents the world, and is a cutting-edge field of artificial general intelligence. Decoding fMRI should not be limited to reconstructing visual stimuli, but also further transforming them into descriptions, creating actions, and even generating unseen content. We purposefully propose a novel and efficient brain multimodal architecture, NeuroCreat, which combines the powerful visual and textual abilities of LLM to capture fine-grained semantic information from fMRI, transformed it into an embodied implementation of different neural representations. Specifically, we innovatively designed a brain expert adaption (BEA) module, effectively capturing commonalities and individual differences among subjects through the collaborative learning of shared/routed experts. Inspired by human visual working memory, we extracted ``creation'' information from higher visual cortex for idea generation. We further constructed a prompt variant alignment module, seamlessly integrates fMRI-visual-semantic-creation into LLM to achieve flexible incorporation of different semantics in the decoding of neural representations. Experiments on different fMRI datasets show that NeuroCreat achieves SOTA performance on multiple brain decoding tasks. More importantly, we have innovatively achieved few-shot brain video creation, which opens up a new direction for demonstrating the brain's `imaginative’ ability.
Poster
Marcos Conde · Zihao Lu · Radu Timofte

[ Exhibit Hall I ]

Abstract
Text-guided image generation and editing is emerging as a fundamental problem in computer vision. However, most approaches lack control, and the generated results are far from professional photography quality standards. In this work, we propose the first approach that introduces language and explicit control into the image processing and editing pipeline. PixTalk is a vision-language multi-task image processing model, guided using text instructions. Our method is able to perform over 40 transformations --the most popular techniques in photography--, delivering results as professional photography editing software. Our model can process 12MP images on consumer GPUs in real-time (under 1 second). As part of this effort, we propose a novel dataset and benchmark for new research on multi-modal image processing and editing.
Poster
KA WONG · Jicheng Zhou · Haiwei Wu · Yain-Whar Si · Jiantao Zhou

[ Exhibit Hall I ]

Abstract
The advancement of image editing tools has enabled malicious manipulation of sensitive document images, underscoring the need for reliable forgery detection. Though forgery detectors for natural images have been extensively studied, they struggle with document images, as the tampered regions can be seamlessly blended into the uniform document backgrounds and structured texts. On the other hand, existing document-specific methods lack sufficient robustness against various degradations, which limits their practical deployment. This paper presents ADCD-Net, a document forgery localization model that adaptively leverages the RGB/DCT forensic traces and integrates key characteristics of document images. Specifically, to address the DCT traces' sensitivity to block misalignment, we adaptively modulate the DCT feature contribution based on a predicted alignment score, resulting in much improved resilience to various distortions, including resizing and cropping. Also, a hierarchical content disentanglement approach is proposed to boost the localization performance via mitigating the text-background disparities. Furthermore, noticing the predominantly pristine nature of background regions, we construct a pristine prototype capturing traces of untampered regions, and eventually enhance both the localization accuracy and robustness. Our proposed ADCD-Net demonstrates superior forgery localization performance, consistently outperforming state-of-the-art methods by 20.79\% averaged over 5 types of distortions. The code is available in supplementary …
Poster
Wenkui Yang · Jie Cao · Junxian Duan · Ran He

[ Exhibit Hall I ]

Abstract
Diffusion models like Stable Diffusion have become prominent in visual synthesis tasks due to their powerful customization capabilities. However, these capabilities also introduce significant security risks, such as deepfakes and copyright infringement. To mitigate these risks, a class of methods known as protective perturbation emerged, which prevents image misuse by injecting imperceptible adversarial noise.On the other hand, purification methods can effectively remove the protective perturbation, thereby exposing images again to the risk of malicious forgery.In this work, we formalize the anti-purification task, highlighting the challenges that existing approaches can not address properly, and propose a solution named **AntiPure**.AntiPure is robust against the "purification-customization'' workflow, owing to the two types of proposed guidance: 1) Patch-wise Frequency Guidance, which reduces the model’s influence over high-frequency components in the purified image, and 2) Erroneous Timestep Guidance, which disrupts the model’s denoising strategy across different timesteps.With additional guidance, AntiPure embeds imperceptible perturbation patterns resistant to purification, achieving effective output distortion after customization. Experiments show that our approach achieves minimal perceptual discrepancy, maximal distortion, and robust performance, outperforming current protective perturbation methods within the purification-customization workflow.
Poster
Xiaoyi Feng · Tao Huang · Peng Wang · Zizhou Huang · Haihang Zhang · Yuntao Zou · Dagang Li · Kaifeng Zou

[ Exhibit Hall I ]

Abstract
Line drawing colorization is a critical step in the cel-animation industry, where artists use a paint bucket tool to apply RGB values to segments based on a character’s color design sheet. Current automated methods predominantly focus on consecutive frame colorization, using a single adjacent frame as a reference. These approaches often face two major challenges: inaccurate segment colorization due to significant deformations between the target and reference frames, and incomplete information in a single frame that prevents finding suitable reference segments, leading to poor color accuracy. To address these challenges, we propose a novel colorization framework that integrates both temporal and structural information. Using multiple reference keyframes, our method effectively captures temporal information across frames, enhancing the accuracy of colorization for transitional frames. In addition, we leverage structural information through a matching-based approach that ensures precise segment alignment across frames. This combination of temporal awareness through multi-frame references and structural alignment improves colorization robustness, even in scenarios with large motion and deformations. Our method outperforms existing techniques, demonstrating superior colorization accuracy and consistency in industrial cel-animation workflows.
Poster
Jiawei Wang · Zhiming Cui · Changjian Li

[ Exhibit Hall I ]

Abstract
This paper presents VQ-SGen, a novel algorithm for high-quality creative sketch generation. Recent approaches have framed the task as pixel-based generation either as a whole or part-by-part, neglecting the intrinsic and contextual relationships among individual strokes, such as the shape and spatial positioning of both proximal and distant strokes. To overcome these limitations, we propose treating each stroke within a sketch as an entity and introducing a vector-quantized (VQ) stroke representation for fine-grained sketch generation. Our method follows a two-stage framework - in stage one, we decouple each stroke's shape and location information to ensure the VQ representation prioritizes stroke shape learning. In stage two, we feed the precise and compact representation into an auto-decoding Transformer to incorporate stroke semantics, positions, and shapes into the generation process. By utilizing tokenized stroke representation, our approach generates strokes with high fidelity and facilitates novel applications, such as text or class label conditioned generation and sketch completion. Comprehensive experiments demonstrate our method surpasses existing state-of-the-art techniques on the CreativeSketch dataset, underscoring its effectiveness. The code and model will be made publicly available upon publication.
Poster
Bo Zhao · Haoran Wang · Jinghui Wang · Hanzhang Wang · Huan Yang · Wei Ji · Hao Liu · Xinyan Xiao

[ Exhibit Hall I ]

Abstract
In this paper, we study the content-aware layout generation problem, which aims to automatically generate layouts that are harmonious with a given background image. Existing methods usually deal with this task with a single-step reasoning framework. The lack of a feedback-based self-correction mechanism leads to their failure rates significantly increasing when faced with complex element layout planning. To address this challenge, we introduce SEGA, a novel Stepwise Evolution paradigm for content-aware layout GenerAtion. Inspired by the systematic mode of human thinking, SEGA employs a hierarchical reasoning framework with a coarse-to-fine strategy: first, a coarse-level module roughly estimates the layout planning results; then, another refining module is leveraged to perform fine-level reasoning regarding the coarse planning results. Furthermore, we incorporate layout design principles as prior knowledge into the module to enhance its layout planning ability. Moreover, we present a new large-scale poster dataset, namely BIG-Poster with rich meta-information annotation. We conduct extensive experiments and obtain remarkable state-of-the-art performance improvement on multiple benchmark datasets.
Poster
Chen Zhennan · Yajie Li · Haofan Wang · Zhibo Chen · Zhengkai Jiang · Jun Li · Qian Wang · Jian Yang · Ying Tai

[ Exhibit Hall I ]

Abstract
Regional prompting, or compositional generation, which enables fine-grained spatial control, has gained increasing attention for its practicality in real-world applications. However, previous methods either introduce additional trainable modules, thus only applicable to specific models, or manipulate on score maps within attention layers using attention masks, resulting in limited control strength when the number of regions increases. To handle these limitations, we present RAGD, a Regional-Aware text-to-image Generation method conditioned on regional descriptions for precise layout composition. RAGD decouples the multi-region generation into two sub-tasks, the construction of individual region (Regional Hard Binding) that ensures the regional prompt is properly executed, and the overall detail refinement (Regional Soft Refinement) over regions that dismiss the visual boundaries and enhance adjacent interactions. Furthermore, RAGD novelly makes repainting feasible, where users can modify specific unsatisfied regions in the last generation while keeping all other regions unchanged, without relying on additional inpainting models. Our approach is tuning-free and applicable to other frameworks as an enhancement to the prompt following property. Quantitative and qualitative experiments demonstrate that RAGD achieves superior performance over attribute binding and object relationship than previous methods.
Poster
Yuyan Chen · Yifan Jiang · Li Zhou · Jinghan Cao · Yu Guan · Ming Yang · Qingpei Guo

[ Exhibit Hall I ]

Abstract
In recent years, multi-modal large language models (MLLMs) have been successfully adopted to generate humorous and engaging descriptions for internet memes. While, it is challenging for the same approaches to apply to ordinary images which lack of inherent funny or exaggerated contents. Thus, crafting appealing descriptions for ordinary image demands imaginative efforts to discover or create intriguing connections between words to image contents. To address this gap, we introduce AppealImage, a large-scale dataset consisting of ordinary images paired with appealing descriptions. AppealImage allows us to define four distinct tasks with quantitative metrics to enable objective evaluation. Subsequently, we propose CharmNet, an innovative framework designed to generate appealing descriptions for ordinary images. CharmNet combines instruction tuning with heuristic active learning, guided by a referee model. Experimental results demonstrate that CharmNet outperforms the state-of-the-art method by 11.4\% in generating appealing descriptions. Furthermore, CharmNet delivers impressive performance across various creative applications, including visual storytelling and situational dialogue generation. These results highlight CharmNet's potential to enhance social media engagement and to empower strong brand presence in competitive markets.
Poster
Xinyu Hou · Zongsheng Yue · Xiaoming Li · Chen Change Loy

[ Exhibit Hall I ]

Abstract
In this work, we show that we only need **a single parameter $\omega$** to effectively control granularity in diffusion-based synthesis. This parameter is incorporated during the denoising steps of the diffusion model’s reverse process. This simple approach does not require model retraining or architectural modifications and incurs negligible computational overhead, yet enables precise control over the level of details in the generated outputs. Moreover, spatial masks or denoising schedules with varying $\omega$ values can be applied to achieve region-specific or timestep-specific granularity control. External control signals or reference images can guide the creation of precise $\omega$ masks, allowing targeted granularity adjustments. Despite its simplicity, the method demonstrates impressive performance across various image and video synthesis tasks and is adaptable to advanced diffusion models. The code will be made publicly available.
Poster
Chen Liu · Tobias Ritschel

[ Exhibit Hall I ]

Abstract
We propose a novel generative video model by robustly learning temporal change as a neural Ordinary Differential Equation (ODE) flow with a bilinear objective of combining two aspects:The first is to map from the past into future video frames directly. Previous work has mapped the noise to new frames, a more computationally expensive process.Unfortunately, starting from the previous frame, instead of noise, is more prone to drifting errors.Hence, second, we additionally learn how to remove the accumulated errors as the joint objective by adding noise during training.We demonstrate unconditional video generation in a streaming manner for various video datasets, all at competitive quality compared to a baseline conditional diffusion but with higher speed, i.e., fewer ODE solver steps.
Poster
Moayed Haji-Ali · Willi Menapace · Aliaksandr Siarohin · Ivan Skorokhodov · Alper Canberk · Kwot Sin Lee · Vicente Ordonez · Sergey Tulyakov

[ Exhibit Hall I ]

Abstract
We propose AV-Link , a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that facilitates bidirectional information exchange between video and audio diffusion models through temporally-aligned self attention operations. Unlike prior work that uses dedicated models for A2V and V2A tasks and relies on pretrained feature extractors, AV-Link achieves both tasks in a single self-contained framework, directly leveraging features obtained by the complementary modality (i.e. video features to generate audio, or audio features to generate video). Extensive automatic and subjective evaluations demonstrate that our method achieves substantial improvement in audio-video synchronization, outperforming more expensive baselines such as the MovieGen video to audio model.
Poster
Jiaqi Han · Haotian Ye · Puheng Li · Minkai Xu · James Zou · Stefano Ermon

[ Exhibit Hall I ]

Abstract
Diffusion-based generative models have become dominant generators of high-fidelity images and videos but remain limited by their computationally expensive inference procedures. Existing acceleration techniques either require extensive model retraining or compromise significantly on sample quality. This paper explores a general, training-free, and model-agnostic acceleration strategy via multi-core parallelism. Our framework views multi-core diffusion sampling as an ODE solver pipeline, where slower yet accurate solvers progressively rectify faster solvers through a theoretically justified inter-core communication mechanism. This motivates our multi-core training-free diffusion sampling accelerator, CHORDS, which is compatible with various diffusion samplers, model architectures, and modalities. Through extensive experiments, CHORDS significantly accelerates sampling across diverse large-scale image and video diffusion models, yielding up to 2.1x speedup with four cores, improving by 50% over baselines, and 2.9x speedup with eight cores, all without quality degradation. This advancement enables CHORDS to establish a solid foundation for real-time, high-fidelity diffusion generation.
Poster
Chieh-Yun Chen · Min Shi · Gong Zhang · Humphrey Shi

[ Exhibit Hall I ]

Abstract
Text-to-image (T2I) generative models have revolutionized content creation but remain highly sensitive to prompt phrasing, often requiring users to repeatedly refine prompts multiple times without clear feedback. While techniques such as automatic prompt engineering, controlled text embeddings, denoising, and multi-turn generation mitigate these issues, they offer limited controllability, or often necessitate additional training, restricting the generalization abilities. Thus, we introduce T2I-Copilot, a training-free multi-agent system that leverages collaboration between (multi-modal) large language models (LLMs) to automate prompt phrasing, model selection, and iterative refinement. This approach significantly simplifies prompt engineering while enhancing generation quality and text-image alignment compared to direct generation. Specifically, T2I-Copilot consists of three agents: (1) Input Interpreter, which parses the input prompt, resolves ambiguities, and generates a standardized report; (2) Generation Engine, which selects the appropriate model from different types of T2I models and organizes visual and textual prompts to initiate generation; and (3) Quality Evaluator, which assesses aesthetic quality and text-image alignment, providing scores and feedback for potential regeneration. T2I-Copilot can operate fully autonomously while also supporting human-in-the-loop intervention for fine-grained control. On GenAI-Bench, using open-source generation models, T2I-Copilot achieves a VQA score comparable to commercial models RecraftV3 and Imagen 3, surpasses FLUX1.1-pro by 6.17\% at only …
Poster
Hengjia Li · Haonan Qiu · Shiwei Zhang · Xiang Wang · Yujie Wei · Zekun Li · Yingya Zhang · Boxi Wu · Deng Cai

[ Exhibit Hall I ]

Abstract
The current text-to-video (T2V) generation has made significant progress in synthesizing realistic general videos, but it is still under-explored in identity-specific human video generation with customized ID images. The key challenge lies in maintaining high ID fidelity consistently while preserving the original motion dynamic and semantic following after the identity injection. Current video identity customization methods mainly rely on reconstructing given identity images on text-to-image models, which have a divergent distribution with the T2V model. This process introduces a tuning-inference gap, leading to dynamic and semantic degradation. To tackle this problem, we propose a novel framework, dubbed $\textbf{PersonalVideo}$, that applies a mixture of reward supervision on synthesized videos instead of the simple reconstruction objective on images. Specifically, we first incorporate identity consistency reward to effectively inject the reference's identity without the tuning-inference gap. Then we propose a novel semantic consistency reward to align the semantic distribution of the generated videos with the original T2V model, which preserves its dynamic and semantic following capability during the identity injection. With the non-reconstructive reward training, we further employ simulated prompt augmentation to reduce overfitting by supervising generated results in more semantic scenarios, gaining good robustness even with only a single reference image.Extensive experiments …
Poster
Zhixiang Guo · Siyuan Liang · Aishan Liu · Dacheng Tao

[ Exhibit Hall I ]

Abstract
Diffusion models have attracted significant attention due to its exceptional data generation capabilities in fields such as image synthesis. However, recent studies have shown that diffusion models are vulnerable to copyright infringement attacks, where attackers inject strategically modified non-infringing images into the training set, inducing the model to generate infringing content under the prompt of specific poisoned captions. To address this issue, we first propose a defense framework, CopyrightShield, to defend against the above attack. Specifically, we analyze the memorization mechanism of diffusion models and find that attacks exploit the model’s overfitting to specific spatial positions and prompts, causing it to reproduce poisoned samples under backdoor triggers. Based on this, we propose a poisoned sample detection method using spatial masking and data attribution to quantify poisoning risk and accurately identify hidden backdoor samples. To further mitigate memorization of poisoned features, we introduce an adaptive optimization strategy that integrates a dynamic penalty term into the training loss, reducing reliance on infringing features while preserving generative performance. Experimental results demonstrate that CopyrightShield significantly improves poisoned sample detection performance across two attack scenarios, achieving average F1-scores of 0.665, retarding the First-Attack Epoch (FAE) of 115. 2% and decreasing the Copyright Infringement Rate (CIR) …
Poster
Zhanzhou Feng · Qingpei Guo · Xinyu Xiao · Ruihan Xu · Ming Yang · Shiliang Zhang

[ Exhibit Hall I ]

Abstract
Existing video generation strategies can be categorized into two categories, i.e., the diffusion and autoregressive (AR) methods. While AR methods achieves high efficiency by predicting the next token based on known visual cues, they generally fall short of diffusion models in terms of video quality. To bridge this gap, this paper introduces a novel continuous-domain next-set prediction strategy.Our approach groups related tokens to be generated into one single set, and simultaneously predicts their probability distributions, thereby better exploiting their spatial and temporal dependencies. Specifically, we propose two token partitioning strategies: Spatial Progressive Partitioning for image tokens and Temporal Next-Frame Partitioning for video tokens. Additionally, we construct a denoising sampler to generate outputs from the token set distribution within a continuous domain. This method unifies image and video generation under a cohesive next-set prediction framework.Experimental results indicate that our method achieves video quality comparable to recent diffusion models, while significantly reducing inference costs. Notably, our method surpasses the recent next token prediction approach Emu3, in video quality despite using approximately 90\% fewer parameters. Visualizations further confirm the effectiveness of our method in capturing intricate details and movements, such as water droplets and facial expressions.All implementations will be released.
Poster
Jeremy Styborski · Mingzhi Lyu · Jiayou Lu · Nupur Kapur · Adams Kong

[ Exhibit Hall I ]

Abstract
Poisoning attacks pose significant challenges to the robustness of diffusion models (DMs). In this paper, we systematically analyze when and where poisoning affects textual inversion, a widely used personalization technique for DMs. We first introduce Semantic Sensitivity Maps (SSM), a novel method for visualizing the influence of poisoning on text embeddings. Second, we identify and experimentally verify that DMs exhibit non-uniform learning behavior across timesteps, focusing on lower-noise samples. Poisoning attacks inherit this bias and inject adversarial signals predominantly at lower timesteps. Third, we find that adversarial signals distract DM learning away from relevant regions within training data, ultimately degrading textual inversion quality. Based on these insights, we propose Safe-Zone Training (SZT), a novel defense mechanism comprised of 3 key components: (1) JPEG compression to weaken high-frequency poison signals, (2) restriction to higher timesteps during textual inversion training to avoid adversarial signals at lower timesteps, and (3) loss masking to constrain learning to relevant regions. Extensive experiments across multiple poisoning methods demonstrate that SZT significantly enhances the robustness of textual inversion against all poisoning attacks, improving average DINOv2 similarity across poisons to 0.43, compared to prior published defenses at 0.26. We will publish code and datasets upon acceptance.
Poster
Haitam Ben Yahia · Denis Korzhenkov · Ioannis Lelekas · Amir Ghodrati · Amir Habibian

[ Exhibit Hall I ]

Abstract
Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized image-to-video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce the computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schemas to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, can generate latents for a 14 x 512 x 256 px clip in 1.7 seconds on a Xiaomi-14 Pro, with negligible quality loss.
Poster
Goker Erdogan · Nikhil Parthasarathy · Catalin Ionescu · Drew Hudson · Alexander Lerchner · Andrew Zisserman · Mehdi S. M. Sajjadi · Joao Carreira

[ Exhibit Hall I ]

Abstract
We introduce LayerLock, a simple yet effective approach for self-supervised visual representation learning, that gradually transitions from pixel to latent prediction through progressive layer freezing. First, we make the observation that during training of video masked-autoencoding (MAE) models, ViT layers converge in the order of their depth: shallower layers converge early, deeper layers converge late. We then show that this observation can be exploited to accelerate standard MAE by progressively freezing the model according to an explicit schedule, throughout training. Furthermore, this same schedule can be used in a simple and scalable approach to latent prediction that does not suffer from "representation collapse". We apply our proposed approach, LayerLock, to large models of up to 4B parameters with results surpassing those of non-latent masked prediction on the 4DS perception suite.
Poster
Kyle Sargent · Kyle Hsu · Justin Johnson · Li Fei-Fei · Jiajun Wu

[ Exhibit Hall I ]

Abstract
Since the advent of popular visual generation frameworks like VQGAN and Latent Diffusion Models, state-of-the-art image generation systems have generally been two-stage systems that first tokenize or compress visual data into a lower-dimensional latent space before learning a generative model. Tokenizer training typically follows a standard recipe in which images are compressed and reconstructed subject to a combination of MSE, perceptual, and adversarial losses. Diffusion autoencoders have been proposed in prior work as a way to learn end-to-end perceptually-oriented image compression, but have not yet shown state-of-the-art performance on the competitive task of ImageNet-1K reconstruction. In this work, we propose FlowMo, a transformer-based diffusion autoencoder. FlowMo achieves a new state-of-the-art for image tokenization at multiple bitrates. We achieve this without using convolutions, adversarial losses, spatially-aligned 2D latent codes, or distilling from other tokenizers. Our key insight is that FlowMo training should be broken into a mode-matching pre-training stage and a mode-seeking post-training stage. We conduct extensive analysis and ablations, and we additionally train generative models atop the FlowMo tokenizer and verify the performance. We will release our code and model checkpoints upon acceptance.
Poster
Zewei Xin · Qinya Li · Chaoyue Niu · Fan Wu · Guihai Chen

[ Exhibit Hall I ]

Abstract
Large text-to-image models demonstrate impressive generation capabilities; however, their substantial size necessitates expensive cloud servers for deployment. Conversely, light-weight models can be deployed on edge devices at lower cost but often with inferior generation quality for complex user prompts. To strike a balance between performance and cost, we propose a routing framework, called RouteT2I, which dynamically selects either the large cloud model or the light-weight edge model for each user prompt. Since generated image quality is challenging to measure and compare directly, RouteT2I establishes multi-dimensional quality metrics, particularly, by evaluating the similarity between the generated images and both positive and negative texts that describe each specific quality metric. RouteT2I then predicts the expected quality of the generated images by identifying key tokens in the prompt and comparing their impact on the quality. RouteT2I further introduces the Pareto relative superiority to compare the multi-metric quality of the generated images. Based on this comparison and predefined cost constraints, RouteT2I allocates prompts to either the edge or the cloud. Evaluation reveals that RouteT2I significantly reduces the number of requesting large cloud model while maintaining high-quality image generation.
Poster
Joonghyuk Shin · Alchan Hwang · Yujin Kim · Daneul Kim · Jaesik Park

[ Exhibit Hall I ]

Abstract
Transformer-based diffusion models have recently superseded traditional U-Net architectures, with multimodal diffusion transformers (MM-DiT) emerging as the dominant approach in state-of-the-art models like Stable Diffusion 3 and Flux.1. Previous approaches relied on unidirectional cross-attention mechanisms, with information flowing from text embeddings to image latents. In contrast, MM-DiT introduces a unified attention mechanism that concatenates input projections from both modalities and performs a single full attention operation, allowing bidirectional information flow between text and image branches. This architectural shift presents significant challenges for existing editing techniques. In this paper, we systematically analyze MM-DiT’s attention mechanism by decomposing attention matrices into four distinct blocks, revealing their inherent characteristics. Through these analyses, we propose a robust prompt-based image editing method for MM-DiT that supports global to local edits across various MM-DiT variants, including few-step models. We believe our findings bridge the gap between existing U-Net-based methods and emerging architectures, offering deeper insights into MMDiT’s behavioral patterns.
Poster
Woo Kyoung Han · Yongjun Lee · Byeonghun Lee · Sang Hyun Park · Sunghoon Im · Kyong Hwan Jin

[ Exhibit Hall I ]

Abstract
Despite significant advances in learning-based lossy compression algorithms, standardizing codecs remains a critical challenge. In this paper, we present the JPEG Processing Neural Operator (JPNeO), a next-generation JPEG algorithm that maintains full backward compatibility with the current JPEG format. Our JPNeO improves chroma component preservation and enhances reconstruction fidelity compared to existing artifact removal methods by incorporating neural operators in both the encoding and decoding stages. JPNeO achieves practical benefits in terms of reduced memory usage and parameter count. We further validate our hypothesis about the existence of a space with high mutual information through empirical evidence.In summary, the JPNeO functions as a high-performance out-of-the-box image compression pipeline without changing source coding's protocol. The source code and demo files are provided in the supplementary material.
Poster
Yuxuan Zhang · Yirui Yuan · Yiren Song · Haofan Wang · Jiaming Liu

[ Exhibit Hall I ]

Abstract
Recent advancements in Unet-based diffusion models, such as ControlNet and IP-Adapter, have introduced effective spatial and subject control mechanisms. However, the DiT (Diffusion Transformer) architecture still struggles with efficient and flexible control. To tackle this issue, we propose EasyControl, a novel framework designed to unify condition-guided diffusion transformers with high efficiency and flexibility. Our framework is built on three key innovations. First, we introduce a lightweight Condition Injection LoRA Module. This module processes conditional signals in isolation, acting as a plug-and-play solution. It avoids modifying the base model’s weights, ensuring compatibility with customized models and enabling the flexible injection of diverse conditions. Notably, this module also supports harmonious and robust zero-shot multi-condition generalization, even when trained only on single-condition data. Second, we propose a Position-Aware Training Paradigm. This approach standardizes input conditions to fixed resolutions, allowing the generation of images with arbitrary aspect ratios and flexible resolutions. At the same time, it optimizes computational efficiency, making the framework more practical for real-world applications. Third, we develop a Causal Attention Mechanism combined with the KV Cache technique, adapted for conditional generation tasks. This innovation significantly reduces the latency of image synthesis, improving the overall efficiency of the framework. Through extensive experiments, …
Poster
Chenghu Du · Shengwu Xiong · Yi Rong

[ Exhibit Hall I ]

Abstract
Current virtual try-on methods primarily enhance performance through network optimization, like coarse-to-fine structures and referenceNet for clothing information injection. However, limited sample quantity and diversity restrict their improvement. To overcome this, we present a unified mask-free virtual try-on framework. It utilizes diffusion models' inherent properties to boost each pipeline part's ability to deeply fit the target sample distribution, thus improving performance. On the input side, our proposed text-driven pseudo-input preparation approach increases the diversity of clothing-agnostic regions in person pseudo-samples. This prompts the generator to focus more on variations in these areas and improves the model's generalization ability. At the generator, we propose gated manipulation to prevent weight forgetting and cut training costs, and introduce texture-aware injection to explicitly add human-perceptible clothing texture info. For inference, our proposed refining conditional inference approach reduces Gaussian noise randomness, thus preserving identity information and clothing details in results. Extensive experiments demonstrate our method outperforms previous virtual try-on methods.
Poster
Sixian Zhang · Xinyao Yu · Xinhang Song · Yiyao Wang · Shuqiang Jiang

[ Exhibit Hall I ]

Abstract
Object goal navigation requires an agent to navigate to a specified target in unseen environments without an explicit map, which demands an understanding of object-scene contextual relationships to infer the target's location based on partial observations.The function of an object plays a crucial role in its categorization and naming. Analyzing an object's functional role within a given scene enhances the understanding of its contextual relationships, thereby aiding in goal inference. In this paper, we propose the function-centric bayesian Network (FBN) for the zero-shot ObjectNav task.FBN is designed to uncover the functions that observed objects afford individually or collaboratively with other objects, as well as the functional semantics contained within the observed scenes. The probabilistic directed edges in FBN describe the object-function and scene-function relationships, which are derived by prompting LLMs with the proposed CounterfactCoT. CounterfactCoT determines existence and probability of edgs, by guiding LLMs to compare the impact of an edge’s existence or absence on the surrounding context.Leveraging FBN with Bayesian inference, the probability of each function group and probability map of goal occurance are computed. Then the waypoint is selected based on obtained probability map. Experiments on MP3D and HM3D demonstrate that FBN effectively captures object-scene-function relationships and improves …
Poster
zihang zou · Boqing Gong · Liqiang Wang

[ Exhibit Hall I ]

Abstract
In this paper, we highlight a critical threat posed by emerging neural models—data plagiarism. We demonstrate how modern neural models (\eg, diffusion models) can effortlessly replicate copyrighted images, even when protected by advanced watermarking techniques. To expose the vulnerability in copyright protection and facilitate future research, we propose a general approach regarding neural plagiarism that can either forge replicas of copyrighted data or introduce copyright ambiguity. Our method, based on ``anchors and shims'', employs inverse latents as anchors and finds shim perturbations that can gradually deviate the anchor latents, thereby evading watermark or copyright detection. By applying perturbation to the cross-attention mechanism at different timesteps, our approach induces varying degrees of semantic modifications in copyrighted images, making it to bypass protections ranging from visible trademarks, signatures to invisible watermarks. Notably, our method is a purely gradient-based search that requires no additional training or fine-tuning. Empirical experiments on MS-COCO and real-world copyrighted images show that diffusion models can replicate copyrighted images, underscoring the urgent need for countermeasures against neural plagiarism.
Poster
Beier Zhu · Ruoyu Wang · Tong Zhao · Hanwang Zhang · Chi Zhang

[ Exhibit Hall I ]

Abstract
Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature. Existing solver-based acceleration methods often face image quality degradation under a low-latency budget. In this paper, we propose the Ensemble Parallel Direction solver (dubbed as EPD-Solver), a novel ODE solver that mitigates truncation errors by incorporating multiple parallel gradient evaluations in each ODE step. Importantly, since the additional gradient computations are independent, they can be fully parallelized, preserving low-latency sampling. Our method optimizes a small set of learnable parameters in a distillation fashion, ensuring minimal training overhead. Extensive experiments on various image synthesis benchmarks demonstrate the effectiveness of our EPD-Solver in achieving high-quality and low-latency sampling. For example, at the same latency level of 5 NFE, EPD achieves an FID of 5.26 on CIFAR-10, 8.74 on FFHQ, 7.95 on ImageNet, and 7.79 on LSUN Bedroom, surpassing existing learning-based solvers by a significant margin.
Poster
Yu Lei · Bingde Liu · Qingsong Xie · Haonan Lu · Zhijie Deng

[ Exhibit Hall I ]

Abstract
Text-to-3D generation based on score distillation of pre-trained 2D diffusion models has gained increasing interest, with variational score distillation (VSD) as a remarkable example. VSD proves that vanilla score distillation can be improved by introducing an extra score-based model, which characterizes the distribution of images rendered from 3D models, to correct the distillation gradient. Despite the theoretical foundations, VSD, in practice, is likely to suffer from slow and sometimes ill-posed convergence.In this paper, we perform an in-depth investigation of the interplay between the introduced score model and the 3D model, and find that there exists a mismatching problem between LoRA and 3D distributions in practical implementation. We can simply adjust their optimization order to improve the generation quality. By doing so, the score model looks ahead to the current 3D state and hence yields more reasonable corrections. Nevertheless, naive lookahead VSD may suffer from unstable training in practice due to the potential over-fitting. To address this, we propose to use a linearized variant of the model for score distillation, giving rise to the Linearized Lookahead Variational Score Distillation ($L^2$-VSD). $L^2$-VSD can be realized efficiently with forward-mode autodiff functionalities of existing deep learning libraries. Extensive experiments validate the efficacy of $L^2$-VSD, …
Poster
Yuanhe Guo · Linxi Xie · Zhuoran Chen · Kangrui Yu · Ryan Po · Guandao Yang · Gordon Wetzstein · Hongyi Wen

[ Exhibit Hall I ]

Abstract
We propose a dataset to enable the study of generative models that understand fine-grained individual preferences.We posit that a key challenge hindering the development of such a generative model is the lack of in-the-wild and fine-grained user preference annotations. Our dataset features real-world interaction data from 57K different users, who collectively have built 242K customized LoRAs, written 3M text prompts, and created 5M generated images. Our dataset enables a set of applications. With aggregate-level user preferences from our dataset, we were able to train better preference alignment models. In addition, leveraging individual-level user preference, we benchmark the performance of retrieval models and a vision-language model on personalized image retrieval and generative model recommendation and highlight the space for improvements. Finally, we demonstrate that our dataset enables, for the first time, a generative model personalization paradigm by editing customized diffusion models in a latent weight space to align with individual user preferences.
Poster
Kien Nguyen · Anh Tran · Cuong Pham

[ Exhibit Hall I ]

Abstract
The rapid growth of text-to-image diffusion models has raised concerns about their potential misuse in generating harmful or unauthorized contents. To address these issues, several Concept Erasure methods have been proposed. However, most of them fail to achieve both completeness, i.e., the ability to entirely remove the target concept, and effectiveness, i.e., maintaining image quality. While few recent techniques successfully achieve these goals for NSFW concepts, none could handle narrow concepts such as copyrighted characters or celebrities. Erasing these narrow concepts is critical in addressing copyright and legal concerns. However, erasing them from diffusion models is challenging due to their close distances to non-target neighboring concepts, requiring finer-grained manipulation. In this paper, we introduce Subspace Mapping (SuMa), a novel method specifically designed to achieve both completeness and effectiveness in easing these narrow concepts. SuMa first derives a target subspace representing the concept to be erased and then neutralizes it by mapping it to a reference subspace that minimizes the distance between the two. This mapping ensures the target concept is fully erased while preserving image quality. We conduct extensive experiments with SuMa across four tasks: subclass erasure, celebrity erasure, artistic style erasure, and instance erasure and compare the results with …
Poster
Alessio Spagnoletti · Jean Prost · Andres Almansa · Nicolas Papadakis · Marcelo Pereyra

[ Exhibit Hall I ]

Abstract
Text-to-image latent diffusion models (LDMs) have recently emerged as powerful generative models with great potential for solving inverse problems in imaging. However, leveraging such models in a Plug \& Play (PnP), zero-shot manner remains challenging because it requires identifying a suitable text prompt for the unknown image of interest. Also, existing text-to-image PnP approaches are highly computationally expensive. We herein address these challenges by proposing a novel PnP inference paradigm specifically designed for embedding generative models within stochastic inverse solvers, with special attention to Latent Consistency Models (LCMs), which distill LDMs into fast generators. We leverage our framework to propose LAtent consisTency INverse sOlver (LATINO), the first zero-shot PnP framework to solve inverse problems with priors encoded by LCMs. Our conditioning mechanism avoids automatic differentiation and reaches SOTA quality in as little as 8 neural function evaluations. As a result, LATINO delivers remarkably accurate solutions and is significantly more memory and computationally efficient than previous approaches. We then embed LATINO within an empirical Bayesian framework that automatically calibrates the text prompt from the observed measurements by marginal maximum likelihood estimation. Extensive experiments show that prompt self-calibration greatly improves estimation, allowing LATINO with PRompt Optimization to define new SOTAs in image …
Poster
Philipp Becker · Abhinav Mehrotra · Ruchika Chavhan · Malcolm Chadwick · Luca Morreale · Mehdi Noroozi · Alberto Gil Couto Pimentel Ramos · Sourav Bhattacharya

[ Exhibit Hall I ]

Abstract
Diffusion Transformers (DiTs) have emerged as a leading architecture for text-to-image synthesis, producing high-quality and photorealistic images.However, the quadratic scaling properties of the attention in DiTs hinder image generation with higher resolution or devices with limited resources. This work introduces an efficient diffusion transformer (EDiT) to alleviate these efficiency bottlenecks in conventional DiTs and Multimodal DiTs (MM-DiTs).First, we present a novel linear compressed attention method that uses a multi-layer convolutional network to modulate queries with local information while keys and values are spatially aggregated.Second, we formulate a hybrid attention scheme for multi-modal inputs that combines linear attention for image-to-image interactions and standard scaled dot-product attention for interactions involving prompts.Merging these two approaches leads to an expressive, linear-time Multimodal Efficient Diffusion Transformer (MM-EDiT).We demonstrate the effectiveness of the EDiT and MM-EDiT architectures by integrating them into PixArt-$\Sigma$ (conventional DiT) and Stable Diffusion 3.5-Medium (MM-DiT), achieving up to $2.2\times$ speedup with comparable image quality after distillation.
Poster
Yiting Qu · Ziqing Yang · Yihan Ma · Michael Backes · Savvas Zannettou · Yang Zhang

[ Exhibit Hall I ]

Abstract
Recent advances in text-to-image diffusion models have enabled the creation of a new form of digital art: optical illusions---visual tricks that create different perceptions of reality.However, adversaries may misuse such techniques to generate hateful illusions, which embed specific hate messages into harmless scenes and disseminate them across web communities.In this work, we take the first step toward investigating the risks of scalable hateful illusion generation and the potential for bypassing current content moderation models.Specifically, we generate 1,860 optical illusions using Stable Diffusion and ControlNet, conditioned on 62 hate messages.Of these, 1,571 are hateful illusions that successfully embed hate messages, either overtly or subtly, forming the Hateful Illusion dataset.Using this dataset, we evaluate the performance of six moderation classifiers and nine vision language models (VLMs) in identifying hateful illusions.Experimental results reveal significant vulnerabilities in existing moderation models: the detection accuracy falls below 0.245 for moderation classifiers and below 0.102 for VLMs.We further identify a critical limitation in their vision encoders, which mainly focus on surface-level image details while overlooking the secondary layer of information, i.e., hidden messages.To address such risks, we demonstrate that preprocessing transformations combining Gaussian blur and histogram equalization can substantially enhance moderation performance.
Poster
Junyu Chen · Dongyun Zou · Wenkun He · Junsong Chen · Enze Xie · Song Han · Han Cai

[ Exhibit Hall I ]

Abstract
We present DC-AE 1.5, a new family of deep compression autoencoders for high-resolution diffusion models. Increasing the autoencoder's latent channel number is a highly effective approach for improving its reconstruction quality. However, it results in slow convergence for diffusion models, leading to poorer generation quality despite better reconstruction quality. This issue limits the quality upper bound of latent diffusion models and hinders the employment of autoencoders with higher spatial compression ratios. We introduce two key innovations to address this challenge: i) Structured Latent Space, a training-based approach to impose a desired channel-wise structure on the latent space with front latent channels capturing object structures and the latter latent channels capturing image details; ii) Augmented Diffusion Training, an augmented diffusion training strategy with additional diffusion training objectives on object latent channels to accelerate convergence. With these techniques, DC-AE 1.5 delivers faster convergence and better diffusion scaling results than DC-AE. On ImageNet 512x512, DC-AE-1.5-f64c128 delivers better image generation quality than DC-AE-f32c32 while being 4x faster. We will release our pre-trained models and code upon publication.
Poster
Shijie Zhou · Ruiyi Zhang · Huaisheng Zhu · Branislav Kveton · Yufan Zhou · Jiuxiang Gu · Jian Chen · Changyou Chen

[ Exhibit Hall I ]

Abstract
We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives, leveraging pretrained multimodal large language models (MLLMs). Existing MLLM-based approaches require instruction-following data for supervised fine-tuning and evaluate generation quality on analyzing text response, which is time-consuming and difficult to train. To address this problem, we propose LLaVA-Reward, which directly utilizes the hidden states of MLLMs given text-image pairs. To enhance the bidirectional interaction between visual and textual representations in decoder-only MLLMs, we further propose adding a Skip-connection Cross Attention (SkipCA) module. This design enhances text-image correlation reasoning by connecting early-layer visual features with later-layer hidden representations.In addition, LLaVA-Reward supports different types of preference data for efficient fine-tuning, including paired preference data and unpaired data. We train LLaVA-Reward on four evaluation perspectives: text-image alignment, fidelity/artifact, safety, and overall ranking. Empirical results demonstrate that LLaVA-Reward outperforms conventional and MLLM-based methods in generating human-aligned scores for automatic evaluations and reinforcement learning in text-to-image generation.
Poster
Gao Zong lin · Huu-Tai Phung · Yi-Chen Yao · Kuan-Wei Ho · Yi-Hsin Chen · Yu-Hsiang Lin · Alessandro Gnutti · Wen-Hsiao Peng

[ Exhibit Hall I ]

Abstract
This work, termed MH-LVC, presents a multi-hypothesis temporal prediction scheme that employs long- and short-term reference frames in a conditional residual video coding framework. Recent temporal context mining approaches to conditional video coding offer superior coding performance. However, the need to store and access a large amount of implicit contextual information extracted from past decoded frames in decoding a video frame poses a challenge due to excessive memory access. Our MH-LVC overcomes this issue by storing multiple long- and short-term reference frames but limiting the number of reference frames used at a time for temporal prediction to two. Our decoded frame buffer management allows the encoder to flexibly utilize the long-term key frames to mitigate temporal cascading errors and the short-term reference frames to minimize prediction errors. Moreover, our buffering scheme enables the temporal prediction structure to be adapted to individual input videos. While this flexibility is common in traditional video codecs, it has not been fully explored for learned video codecs. Extensive experiments show that the proposed method outperforms VTM-17.0 under the low-delay B configuration in terms of PSNR-RGB across commonly used test datasets, and performs comparably to the state-of-the-art learned codecs (e.g.~DCVC-FM) while requiring less decoded frame buffer …
Poster
Li · Yang Xiao · Jie Ji · Kaiyuan Deng · Bo Hui · Linke Guo · Xiaolong Ma

[ Exhibit Hall I ]

Abstract
Text-to-image (T2I) diffusion models have achieved remarkable success in generating high-quality images from textual prompts. However, their ability to store vast amounts of knowledge raises concerns in scenarios where selective forgetting is necessary, such as removing copyrighted content, reducing biases, or eliminating harmful concepts. While existing unlearning methods can remove certain concepts, they struggle with multi-concept forgetting due to instability, residual knowledge persistence, and generation quality degradation. To address these challenges, we propose **Dynamic Mask coupled with Concept-Aware Loss**, a novel unlearning framework designed for multi-concept forgetting in diffusion models. Our **Dynamic Mask** mechanism adaptively updates gradient masks based on current optimization states, allowing selective weight modifications that prevent interference with unrelated knowledge. Additionally, our **Concept-Aware Loss** explicitly guides the unlearning process by enforcing semantic consistency through superclass alignment, while a regularization loss based on knowledge distillation ensures that previously unlearned concepts remain forgotten during sequential unlearning. We conduct extensive experiments to evaluate our approach. Results demonstrate that our method outperforms existing unlearning techniques in forgetting effectiveness, output fidelity, and semantic coherence, particularly in multi-concept scenarios. Our work provides a principled and flexible framework for stable and high-fidelity unlearning in generative models. The code will be released publicly.
Poster
Luca Bartolomei · Enrico Mannocci · Fabio Tosi · Matteo Poggi · Stefano Mattoccia

[ Exhibit Hall I ]

Abstract
Event cameras capture sparse, high-temporal-resolution visual information, making them particularly suitable for challenging environments with high-speed motion and strongly varying lighting conditions. However, the lack of large datasets with dense ground-truth depth annotations hinders learning-based monocular depth estimation from event data. To address this limitation, we propose a cross-modal distillation paradigm to generate dense proxy labels leveraging a Vision Foundation Model (VFM). Our strategy requires an event stream spatially aligned with RGB frames, a simple setup even available off-the-shelf, and exploits the robustness of large-scale VFMs.Additionally, we propose to adapt VFMs, either a vanilla one like Depth Anything v2 (DAv2), or deriving from it a novel recurrent architecture to infer depth from monocular event cameras. We evaluate our approach using synthetic and real-world datasets, demonstrating that i) our cross-modal paradigm achieves competitive performance compared to fully supervised methods without requiring expensive depth annotations, and ii) our VFM-based models achieve state-of-the-art performance

Demonstration: Demos 4 Wed 22 Oct 02:30 p.m.  

  • Underwater NeRFs with an AUV, Selena Sun, Elsa McElhinney, Vassilis Alexopoulous, Felicie Hoffman, Lawton Scaling, Kai Song, Rohan Bhowmik, Angelina Krinos, Ryota Sato, Chisa Ogaki, Julian Shultz, Professor Oussama Khatib
  • Scalable Object Detection in Mixed Reality using Incremental Re- training and One-shot 3D Annotation, Alireza Taheritajar
  • SOAP: Style-Omniscient Animatable Portraits, Tingting Liao
  • Vision-guided Shared Autonomy for Smarter Prosthetics Federico Vasile

PAMI TC Meeting Wed 22 Oct 04:45 p.m.  


Reception Wed 22 Oct 06:30 p.m.