Skip to yearly menu bar Skip to main content



Orals
Oral
Frano Rajič · Haofei Xu · Marko Mihajlovic · Siyuan Li · Irem Demir · Emircan Gündoğdu · Lei Ke · Sergey Prokudin · Marc Pollefeys · Siyu Tang

[ Kalakaua Ballroom ]

Abstract
We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. Unlike existing monocular trackers, which struggle with depth ambiguities and occlusion, or previous multi-camera methods that require over 20 cameras and tedious per-sequence optimization, our feed-forward model directly predicts 3D correspondences using a practical number of cameras (e.g., four), enabling robust and accurate online tracking. Our tracker fuses multi-view features into a unified point cloud and applies k-nearest-neighbors correlation alongside a transformer-based update to reliably estimate long-range 3D correspondences, even under occlusion. We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks—Panoptic Studio and DexYCB—where we achieve median trajectory errors of 3.2 cm and 2.3 cm, respectively. Notably, on DexYCB, our method surpasses the strongest single-view tracker by 58.2% and a simpler multi-view triplane-based baseline by 46.5%. It also generalizes better to diverse camera setups of 1–8 cameras with varying vantage points and video lengths of 24–150 frames. By releasing our pre-trained tracker alongside training and evaluation datasets, we aim to set a new standard for multi-view 3D tracking research and provide a practical tool for a wide range of real-world applications.
Oral
David G. Shatwell · Ishan Rajendrakumar Dave · Swetha Sirnam · Mubarak Shah

[ Exhibit Hall III ]

Abstract
Timestamp prediction aims to determine when an image was captured using only visual information, supporting applications such as metadata correction, retrieval, and digital forensics. In outdoor scenarios, hourly estimates rely on cues like brightness, hue, and shadow positioning, while seasonal changes and weather inform date estimation. However, these visual cues significantly depend on geographic context, closely linking timestamp prediction to geo-localization. To address this interdependence, we introduce GT-Loc, a novel retrieval-based method that jointly predicts the capture time (hour and month) and geo-location (GPS coordinates) of an image. Our approach employs separate encoders for images, time, and location, aligning their embeddings within a shared high-dimensional feature space. Recognizing the cyclical nature of time, we utilize Random Fourier Features for effective temporal representation. Instead of conventional contrastive learning with hard positives and negatives, we propose a metric-learning objective providing soft targets by modeling temporal differences over a cyclical toroidal surface. We present new benchmarks demonstrating that our joint optimization surpasses methods focused solely on time prediction and even those utilizing geo-location during inference. Additionally, our approach achieves competitive results on standard geo-localization tasks, while the unified embedding space facilitates compositional and text-based image retrieval.
Oral
Mustafa Shukor · Enrico Fini · Victor Guilherme Turrisi da Costa · Matthieu Cord · Joshua Susskind · Alaaeldin El-Nouby

[ Exhibit Hall III ]

Abstract
Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained components, such as connecting vision encoders to LLMs and continuing training on multimodal data. While such approaches exhibit remarkable sample efficiency, it remains an open question whether such late-fusion architectures are inherently superior. In this work, we revisit the architectural design of native multimodal models (NMMs)—those trained from the ground up on all modalities—and conduct an extensive scaling laws study, spanning 457 trained models with different architectures and training mixtures. Our investigation reveals no inherent advantage to late-fusion architectures over early-fusion ones, which do not rely on pre-trained image encoders or tokenizers. On the contrary, early-fusion exhibits stronger performance at lower parameter count, is more efficient to train, and is easier to deploy. Motivated by the strong performance of the early-fusion architectures, we show that incorporating Mixture of Experts (MoEs) allows models to learn modality-specific weights, significantly benefiting performance.
Oral
Jonathan Ventura · Viktor Larsson · Fredrik Kahl

[ Kalakaua Ballroom ]

Abstract
Spherical motion is a special case of camera motion where the camera moves on the imaginary surface of a sphere with the optical axis normal to the surface. Common sources of spherical motion are a person capturing a stereo panorama with a phone held in an outstretched hand, or a hemi-spherical camera rig used for multi-view scene capture. However, traditional structure-from-motion pipelines tend to fail on spherical camera motion sequences, especially when the camera is facing outward. Building upon prior work addressing the calibrated case, we explore uncalibrated reconstruction from spherical motion, assuming a fixed but unknown focal length parameter. We show that, although two-view spherical motion is always a critical case, self-calibration is possible from three or more views. Through analysis of the relationship between focal length and spherical relative pose, we devise a global structure-from-motion approach for uncalibrated reconstruction. We demonstrate the effectiveness of our approach on real-world captures in various settings, even when the camera motion deviates from perfect spherical motion.
Oral
Simon Kiefhaber · Stefan Roth · Simone Schaub-Meyer

[ Kalakaua Ballroom ]

Abstract
Cost volumes are used in every modern optical flow estimator, but due to their computational and space complexity, they are often a limiting factor in optical flow methods regarding both processing speed and the resolution of input frames. Motivated by our empirical observation that cost volumes lose their importance once all other network parts of, e.g., a RAFT-based pipeline have been sufficiently trained, we introduce a training strategy that allows to remove the cost volume from optical flow estimators throughout training. This leads to significantly improved inference speed and reduced memory requirements. Using our training strategy, we create three different models covering different compute budgets. Our most accurate model reaches state-of-the-art accuracy while being $1.2\times$ faster and having a $6\times$ lower memory footprint than comparable models; our fastest model is capable of processing Full HD frames at $20\mathrm{FPS}$ using only $500\mathrm{MB}$ of memory.
Oral
Shuai Tan · Bill Gong · Bin Ji · Ye Pan

[ Exhibit Hall III ]

Abstract
Talking head generation is gaining significant importance across various domains, with a growing demand for high-quality rendering. However, existing methods often suffer from identity leakage (IL) and rendering artifacts (RA), particularly in extreme cases. Through an in-depth analysis of previous approaches, we identify two key insights: (1) IL arises from identity information embedded within motion features, and (2) this identity information can be leveraged to address RA. Building on these findings, this paper introduces FixTalk, a novel framework designed to simultaneously resolve both issues for high-quality talking head generation. Firstly, we propose an **Enhanced Motion Indicator (EMI)** to effectively decouple identity information from motion features, mitigating the impact of IL on generated talking heads. To address RA, we introduce an **Enhanced Detail Indicator (EDI)**, which utilizes the leaked identity information to supplement missing details, thus fixing the artifacts. Extensive experiments demonstrate that FixTalk effectively mitigates IL and RA, achieving superior performance compared to state-of-the-art methods.
Oral
Jerred Chen · Ronald Clark

[ Kalakaua Ballroom ]

Abstract
In many robotics and VR/AR applications, fast camera motions cause a high level of motion blur, causing existing camera pose estimation methods to fail. In this work, we propose a novel framework that leverages motion blur as a rich cue for motion estimation rather than treating it as an unwanted artifact. Our approach works by predicting a dense motion flow field and a monocular depth map directly from a single motion-blurred image. We then recover the instantaneous camera velocity by solving a linear least squares problem under the small motion assumption. In essence, our method produces an IMU-like measurement that robustly captures fast and aggressive camera movements. To train our model, we construct a large-scale dataset with realistic synthetic motion blur derived from ScanNet++v2 and further refine our model by training end-to-end on real data using our fully differentiable pipeline. Extensive evaluations on real-world benchmarks demonstrate that our method achieves state-of-the-art angular and translational velocity estimates, outperforming current methods like MASt3R and COLMAP.
Oral
Derong Jin · Ruohan Gao

[ Exhibit Hall III ]

Abstract
An immersive acoustic experience enabled by spatial audio is just as crucial as the visual aspect in creating realistic virtual environments. However, existing methods for room impulse response estimation rely either on data-demanding learning-based models or computationally expensive physics-based modeling. In this work, we introduce Audio-Visual Differentiable Room Acoustic Rendering (AV-DAR), a framework that leverages visual cues extracted from multi-view images and acoustic beam tracing for physics-based room acoustic rendering. Experiments across six real-world environments from two datasets demonstrate that our multimodal, physics-based approach is efficient, interpretable, and accurate, significantly outperforming a series of prior methods. Notably, on the Real Acoustic Field dataset, AV-DAR achieves comparable performance to models trained on 10 times more data while delivering relative gains ranging from 16.6% to 50.9% when trained at the same scale.
Oral
Yi Li · Hualiang Wang · Xinpeng Ding · Haonan Wang · Xiaomeng Li

[ Exhibit Hall III ]

Abstract
Multimodal large language models (MLLMs) are broadly empowering various fields. Despite their advancements, the explainability of MLLMs remains less explored, hindering deeper understanding, model credibility, and effective visualization. Unlike conventional vision models (e.g., CNNs, ViTs, CLIP) that produce a single output, MLLMs generate sequences of tokens progressively, where each generated token depends on the previous context. Therefore, earlier context tokens can introduce redundant activations that interfere with the explanation of later tokens beyond their original information. Existing studies often overlook this issue, but our observations reveal that these redundant correlations can significantly hurt the reliability of explanations. To address this, we propose an estimated causal inference method to mitigate the interference of context to achieve high-quality MLLM explanation, with a novel rank Gaussian filter to further reduce activation noises. We term this method Token Activation Map (TAM) to highlight the consideration of interactions between tokens. TAM also indicates that it excels at explaining multiple tokens of MLLM, which is different from the Class Activation Map (CAM) for a single prediction. Our TAM method significantly outperforms existing SoTA methods, showcasing high-quality visualization results that can be utilized for various scenarios, such as object localization, failure case analysis, video visualization, MLLMs visual …
Oral
Mark YU · Wenbo Hu · Jinbo Xing · Ying Shan

[ Kalakaua Ballroom ]

Abstract
We present TrajectoryCrafter, a novel approach to redirect camera trajectories for monocular videos. By disentangling deterministic view transformations from stochastic content generation, our method achieves precise control over user-specified camera trajectories. We propose a novel dual-stream conditional video diffusion model that concurrently integrates point cloud renders and source videos as conditions, ensuring accurate view transformations and coherent 4D content generation. Instead of leveraging scarce multi-view videos, we curate a hybrid training dataset combining web-scale monocular videos with static multi-view datasets, by our innovative double-reprojection strategy, significantly fostering robust generalization across diverse scenes. Extensive evaluations on multi-view and large-scale monocular videos demonstrate the superior performance of our method. Code and pre-trained model will be released.
Oral
Hanwen Jiang · Hao Tan · Peng Wang · Haian Jin · Yue Zhao · Sai Bi · Kai Zhang · Fujun Luan · Kalyan Sunkavalli · Qixing Huang · Georgios Pavlakos

[ Exhibit Hall III ]

Abstract
We present RayZer, a self-supervised multi-view 3D Vision model trained without any 3D supervision, i.e., camera poses and scene geometry, while exhibiting emerging 3D awareness. Concretely, RayZer takes unposed and uncalibrated images as input, recovers camera parameters, reconstructs a scene representation, and synthesizes novel views. During training, RayZer relies solely on its self-predicted camera poses to render target views, eliminating the need for any ground-truth camera annotations and allowing RayZer to be trained with 2D image supervision. The emerging 3D awareness of RayZer is attributed to two key factors. First, we design a self-supervised framework, which achieves 3D-aware auto-encoding of input images by disentangling camera and scene representations. Second, we design a transformer-based model in which the only 3D prior is the ray structure, connecting camera, pixel, and scene simultaneously. RayZer demonstrates comparable or even superior novel view synthesis performance than ``oracle'' methods that rely on pose annotations in both training and testing.
Oral
Uranik Berisha · Jens Mehnert · Alexandru Condurache

[ Kalakaua Ballroom ]

Abstract
Increasingly expensive training of ever larger models such as Vision Transfomers motivate reusing the vast library of already trained state-of-the-art networks. However, their latency, high computational costs and memory demands pose significant challenges for deployment, especially on resource-constrained hardware. While structured pruning methods can reduce these factors, they often require costly retraining, sometimes for up to hundreds of epochs, or even training from scratch to recover the lost accuracy resulting from the structural modifications. Maintaining the provided performance of trained models after structured pruning and thereby avoiding extensive retraining remains a challenge. To solve this, we introduce Variance-Based Pruning, a simple and structured one-shot pruning technique for efficiently compressing networks, with minimal finetuning. Our approach first gathers activation statistics, which are then used to select neurons for pruning. Simultaneously the mean activations are integrated back into the model to preserve a high degree of performance. On ImageNet-1k recognition tasks, we demonstrate that directly after pruning DeiT-Base retains over 70% of its original performance and requires only 10 epochs of fine-tuning to regain 99% of the original accuracy while simultaneously reducing MACs by 35% and model size by 36%, thus speeding up the model by 1.44 times.
Oral
Alexander Mai · Peter Hedman · George Kopanas · Dor Verbin · David Futschik · Qiangeng Xu · Falko Kuester · Jonathan Barron · Yinda Zhang

[ Exhibit Hall III ]

Abstract
We present Exact Volumetric Ellipsoid Rendering (EVER), a method for real-time 3D reconstruction.EVER accurately blends an unlimited number of overlapping primitives together in 3D space, eliminating the popping artifacts that 3D Gaussian Splatting (3DGS) and other related methods exhibit.EVER represents a radiance field as a set of constant-density volumetric ellipsoids, which are raytraced by intersecting each primitive twice (once upon ray entrance and another on ray exit) and accumulating the derivatives of the densities and colors along the ray.Because EVER is built around ray tracing, it also enables effects such as defocus blur and fish-eye camera distortion, while still achieving frame rates of ~30 FPS at 720p on an NVIDIA RTX4090. We show that our method is more accurate on the challenging large-scale scenes from the Zip-NeRF dataset, where it achieves state of the art SSIM, even higher than Zip-NeRF.
Oral
Haoyu Wu · Jingyi Xu · Hieu Le · Dimitris Samaras

[ Kalakaua Ballroom ]

Abstract
Token merging can effectively accelerate various vision systems by processing groups of similar tokens only once and sharing the results across them. However, existing token grouping methods are often ad hoc and random, disregarding the actual content of the samples. We show that preserving high-information tokens during merging—those essential for semantic fidelity and structural details—significantly improves sample quality, producing finer details and more coherent, realistic generations. Despite being simple and intuitive, this approach remains underexplored.To do so, we propose an importance-based token merging method that prioritizes the most critical tokens in computational resource allocation, leveraging readily available importance scores, such as those from classifier-free guidance in diffusion models. Experiments show that our approach significantly outperforms baseline methods across multiple applications, including text-to-image synthesis, multi-view image generation, and video generation with various model architectures such as Stable Diffusion, Zero123++, AnimateDiff, or PixArt-$\alpha$.
Oral
Chen Zhao · Xuan Wang · Tong Zhang · Saqib Javed · Mathieu Salzmann

[ Exhibit Hall III ]

Abstract
3D Gaussian Splatting (3DGS) has demonstrated remarkable effectiveness in novel view synthesis (NVS). However, 3DGS tends to overfit when trained with sparse views, limiting its generalization to novel viewpoints. In this paper, we address this overfitting issue by introducing Self-Ensembling Gaussian Splatting (SE-GS). We achieve self-ensembling by incorporating an uncertainty-aware perturbation strategy during training. A $\mathbf{\Delta}$-model and a $\mathbf{\Sigma}$-model are jointly trained on the available images. The $\mathbf{\Delta}$-model is dynamically perturbed based on rendering uncertainty across training steps, generating diverse perturbed models with negligible computational overhead. Discrepancies between the $\mathbf{\Sigma}$-model and these perturbed models are minimized throughout training, forming a robust ensemble of 3DGS models. This ensemble, represented by the $\mathbf{\Sigma}$-model, is then used to generate novel-view images during inference. Experimental results on the LLFF, Mip-NeRF360, DTU, and MVImgNet datasets demonstrate that our approach enhances NVS quality under few-shot training conditions, outperforming existing state-of-the-art methods.
Oral
Yunuo Chen · Zezheng Lyu · Bing He · Ning Cao · Gang chen · Guo Lu · Wenjun Zhang

[ Kalakaua Ballroom ]

Abstract
Recent learned image compression (LIC) models have achieved remarkable rate-distortion (RD) performance, yet their high computational complexity severely limits practical deployment. To overcome this challenge, we propose a novel Stage-wise Modular Distillation framework, SMoDi, which efficiently compresses LIC models while preserving RD performance. This framework treats each stage of LIC models as an independent sub-task, mirroring the teacher model’s task decomposition to student, thereby simplifying knowledge transfer.We identify two crucial factors determining the effectiveness of knowledge distillation: student model construction and loss function design. Specifically, we first propose Teacher-Guided Student Model Construction, a pruning-like method ensuring architectural consistency between teacher and student models. Next, we introduce Implicit End-to-end Supervision, facilitating adaptive energy compaction and bitrate regularization.Based on these insights, we develop KDIC, a lightweight student model derived from the state-of-the-art S2CFormer model. Experimental results demonstrate that KDIC achieves top-tier RD performance with significantly reduced computational complexity. To our knowledge, this work is among the first successful applications of knowledge distillation to learned image compression.
Oral
Weirong Chen · Ganlin Zhang · Felix Wimbauer · Rui Wang · Nikita Araslanov · Andrea Vedaldi · Daniel Cremers

[ Exhibit Hall III ]

Abstract
Traditional SLAM systems, which rely on bundle adjustment, often struggle with highly dynamic scenes commonly found in casual videos. Such videos entangle the motion of dynamic elements, undermining the assumption of static environments required by traditional systems. Existing techniques either filter out dynamic elements or model their motion independently. However, the former often results in incomplete reconstructions, whereas the latter can lead to inconsistent motion estimates.This work proposes a novel approach that leverages a 3D point tracker to decouple the static and dynamic motion, effectively separating the camera-induced motion from the motion of dynamic objects.Bundle adjustment can therefore operate reliably considering only the camera-induced component of the observed motion. We further ensure depth consistency across video frames with lightweight post-processing based on scale maps.Our framework combines the core of traditional SLAM -- bundle adjustment -- with a robust learning-based 3D tracker front-end.By integrating motion decomposition, bundle adjustment, and depth refinement into a unified framework, our method accurately tracks the camera motion and produces temporally coherent and scale-consistent dense reconstructions, accommodating both static and dynamic elements. Our experiments on challenging datasets reveal significant improvements in camera pose estimation and 3D reconstruction accuracy.
Oral
shengyuan zhang · An Zhao · Ling Yang · Zejian Li · Chenye Meng · Haoran Xu · Tianrun Chen · AnYang Wei · Perry GU · Lingyun Sun

[ Kalakaua Ballroom ]

Abstract
Diffusion models have been applied to 3D LiDAR scene completion due to their strong training stability and high completion quality.However, the slow sampling speed limits the practical application of diffusion-based scene completion models since autonomous vehicles require an efficient perception of surrounding environments. This paper proposes a novel distillation method tailored for 3D LiDAR scene completion models, dubbed $\textbf{ScoreLiDAR}$, which achieves efficient yet high-quality scene completion.ScoreLiDAR enables the distilled model to sample in significantly fewer steps after distillation.To improve completion quality, we also introduce a novel $\textbf{Structural Loss}$, which encourages the distilled model to capture the geometric structure of the 3D LiDAR scene.The loss contains a scene-wise term constraining the holistic structure and a point-wise term constraining the key landmark points and their relative configuration.Extensive experiments demonstrate that ScoreLiDAR significantly accelerates the completion time from 30.55 to 5.37 seconds per frame ($>$5$\times$) on SemanticKITTI and achieves superior performance compared to state-of-the-art 3D LiDAR scene completion models.
Oral
Ruonan Yu · Songhua Liu · Zigeng Chen · Jingwen Ye · Xinchao Wang

[ Kalakaua Ballroom ]

Abstract
Dataset distillation or condensation aims to condense a large-scale training dataset into a much smaller synthetic one such that the training performance of distilled and original sets on neural networks are similar. Although the number of training samples can be reduced substantially, current state-of-the-art methods heavily rely on enormous soft labels to achieve satisfactory performance. As a result, the required storage can be comparable even to original datasets, especially for large-scale ones. To solve this problem, instead of storing these heavy labels, we propose a novel label-lightening framework termed HeLlO aiming at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images. Specifically, to construct such projectors, we leverage prior knowledge in open-source foundation models, e.g., CLIP, and introduce a LoRA-like fine-tuning strategy to mitigate the gap between pre-trained and target distributions, so that original models for soft-label generation can be distilled into a group of low-rank matrices. Moreover, an effective image optimization method is proposed to further mitigate the potential error between the original and distilled label generators. Extensive experiments show that our method significantly reduces the storage cost to merely 0.001% compared to full soft-label storage methods while achieving comparable performance to state-of-the-art …
Oral
Yue Li · Qi Ma · Runyi Yang · Huapeng Li · Mengjiao Ma · Bin Ren · Nikola Popovic · Nicu Sebe · Ender Konukoglu · Theo Gevers · Luc Gool · Martin Oswald · Danda Pani Paudel

[ Exhibit Hall III ]

Abstract
Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training, or together at inference. This highlights a clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Meanwhile, 3D Gaussian Splatting (3DGS) has emerged as the de facto standard for 3D scene representation across various vision tasks. However, effectively integrating semantic reasoning into 3DGS in a generalizable fashion remains an open challenge.To address these limitations we introduce SceneSplat, to our knowledge the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS. Furthermore, we propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. In order to power the proposed methods, we introduce SceneSplat-7K, the first large-scale 3DGS dataset for indoor scenes, comprising of 6868 scenes derived from 7 established datasets like ScanNet, Matterport3D, etc. Generating SceneSplat-7K required computational resources equivalent to 119 GPU-days on an L4 GPU, enabling standardized benchmarking for 3DGS-based reasoning for indoor scenes.Our exhaustive experiments on SceneSplat-7K demonstrate the significant benefit of the proposed methods over the established baselines. …
Oral
Huiyang Hu · Peijin Wang · Hanbo Bi · Boyuan Tong · Zhaozhi Wang · Wenhui Diao · Hao Chang · Yingchao Feng · Ziqi Zhang · Yaowei Wang · Qixiang Ye · Kun Fu · Xian Sun

[ Exhibit Hall III ]

Abstract
Remote sensing foundation models largely break away from the traditional paradigm of designing task-specific models, offering greater scalability across multiple tasks. However, they face challenges such as low computational efficiency and limited interpretability, especially when dealing with large-scale remote sensing images. To overcome these, we draw inspiration from heat conduction, a physical process modeling local heat diffusion. Building on this idea, we are the first to explore the potential of using the parallel computing model of heat conduction to simulate the local region correlations in high-resolution remote sensing images, and introduce RS-vHeat, an efficient multi-modal remote sensing foundation model. Specifically, RS-vHeat 1) applies the Heat Conduction Operator (HCO) with a complexity of $O(N^{1.5})$ and a global receptive field, reducing computational overhead while capturing remote sensing object structure information to guide heat diffusion; 2) learns the frequency distribution representations of various scenes through a self-supervised strategy based on frequency domain hierarchical masking and multi-domain reconstruction; 3) significantly improves efficiency and performance over state-of-the-art techniques across 4 tasks and 10 datasets. Compared to attention-based remote sensing foundation models, we reduce memory usage by 84\%, FLOPs by 24\% and improves throughput by 2.7 times. The code will be made publicly available.
Oral
Tianyi Wang · Shuaicheng Niu · Harry Cheng · xiao zhang · Yinglong Wang

[ Kalakaua Ballroom ]

Abstract
Suffering from performance bottlenecks in passively detecting high-quality Deepfake images due to the advancement of generative models, proactive perturbations offer a promising approach to disabling Deepfake manipulations by inserting signals into benign images. However, existing proactive perturbation approaches remain unsatisfactory in several aspects: 1) visual degradation due to direct element-wise addition; 2) limited effectiveness against face swapping manipulation; 3) unavoidable reliance on white- and grey-box settings to involve generative models during training. In this study, we analyze the essence of Deepfake face swapping and argue the necessity of protecting source identities rather than target images, and we propose NullSwap, a novel proactive defense approach that cloaks source image identities and nullifies face swapping under a pure black-box scenario. We design an Identity Extraction module to obtain facial identity features from the source image, while a Perturbation Block is then devised to generate identity-guided perturbations accordingly. Meanwhile, a Feature Block extracts shallow-level image features, which are then fused with the perturbation in the Cloaking Block for image reconstruction. Furthermore, to ensure adaptability across different identity extractors in face swapping algorithms, we propose Dynamic Loss Weighting to adaptively balance identity losses. Experiments demonstrate the outstanding ability of our approach to fool various …
Oral
Ekkasit Pinyoanuntapong · Muhammad Usama Saleem · Korrawe Karunratanakul · Pu Wang · Hongfei Xue · Chen Chen · chuan guo · Junli Cao · Jian Ren · Sergey Tulyakov

[ Kalakaua Ballroom ]

Abstract
Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, these models struggle to achieve high-precision control while maintaining high-quality motion generation. To address these challenges, we propose MaskControl, the first approach to introduce controllability to the generative masked motion model. Our approach introduces two key innovations. First, \textit{Logits Regularizer} implicitly perturbs logits at training time to align the distribution of motion tokens with the controlled joint positions, while regularizing the categorical token prediction to ensure high-fidelity generation. Second, \textit{Logit Optimization} explicitly optimizes the predicted logits during inference time, directly reshaping the token distribution that forces the generated motion to accurately align with the controlled joint positions. Moreover, we introduce \textit{Differentiable Expectation Sampling (DES)} to combat the non-differential distribution sampling process encountered by logits regularizer and optimization. Extensive experiments demonstrate that MaskControl outperforms state-of-the-art methods, achieving superior motion quality (FID decreases by ~77\%) and higher control precision (average error 0.91 vs. 1.08). Additionally, MaskControl enables diverse applications, including any-joint-any-frame control, body-part timeline control, and zero-shot objective control. Video visualization can be found at \url{https://anonymous-ai-agent.github.io/CAM}
Oral
Yi Wang · Zhitong Xiong · Chenying Liu · Adam Stewart · Thomas Dujardin · Nikolaos Ioannis Bountos · Angelos Zavras · Franziska Gerken · Ioannis Papoutsis · Laura Leal-Taixé · Xiao Xiang Zhu

[ Exhibit Hall III ]

Abstract
Advances in Earth observation (EO) foundation models have unlocked the potential of big satellite data to learn generic representations from space, benefiting a wide range of downstream applications crucial to our planet. However, most existing efforts remain limited to fixed spectral sensors, focus solely on the Earth's surface, and overlook valuable metadata beyond imagery. In this work, we take a step towards next-generation EO foundation models with three key components: 1) Copernicus-Pretrain, a massive-scale pretraining dataset that integrates 18.7M aligned images from all major Copernicus Sentinel missions, spanning from the Earth's surface to its atmosphere; 2) Copernicus-FM, a unified foundation model capable of processing any spectral or non-spectral sensor modality using extended dynamic hypernetworks and flexible metadata encoding; and 3) Copernicus-Bench, a systematic evaluation benchmark with 15 hierarchical downstream tasks ranging from preprocessing to specialized applications for each Sentinel mission. Our dataset, model, and benchmark greatly improve the scalability, versatility, and multimodal adaptability of EO foundation models, while also creating new opportunities to connect EO, weather, and climate research.
Oral
Yibin Yan · Jilan Xu · Shangzhe Di · Yikun Liu · Yudi Shi · Qirui Chen · Zeqian Li · Yifei Huang · Weidi Xie

[ Exhibit Hall III ]

Abstract
Understanding continuous video streams plays a fundamental role in real-time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency decisions.To address these challenges, our main contributions are three-fold. (i) We develop a novel streaming video backbone, termed as **StreamFormer**, by incorporating causal temporal attention into a pre-trained vision transformer. This enables efficient streaming video processing while maintaining image representation capability. (ii) To train StreamFormer, we propose to unify diverse spatial-temporal video understanding tasks within a multitask visual-language alignment framework. Hence, StreamFormer learns global semantics, temporal dynamics, and fine-grained spatial relationships simultaneously. (iii) We conduct extensive experiments on online action detection, online video instance segmentation, and video question answering. StreamFormer achieves competitive results while maintaining efficiency, demonstrating its potential for real-time applications.
Oral
Byungjun Kim · Shunsuke Saito · Giljoo Nam · Tomas Simon · Jason Saragih · Hanbyul Joo · Junxuan Li

[ Kalakaua Ballroom ]

Abstract
We present a universal prior model for 3D head avatar with hair compositionality. Existing approaches for building generalizable prior for 3D head avatar often model face and hair in a monolithic manner, where the inherent compositonality of the human head and hair is not considered. It is especially challenging for the monolithic model to self-discover the compositionality of face and hair when the dataset is not large enough. Moreover, extending the monolithic model for applications like swapping faces or hairstyles in 3D is not straightforward. Our prior model explicitly accounts for the compositionality of face and hair, learning their priors separately. To learn a disentangled latent spaces of face and hair of 3D head avatars, we propose a synthetic hairless data creation pipeline for dehairing the studio-captured dataset with estimated hairless geometry and hairless texture obtained from diffusion prior. Using a paired dataset of hair and hairless captures, disentangled prior models for face and hair can be trained by leveraging compositionality as an inductive bias to achieve disentanglement. Our model's inherent compositionality enables a seamless transfer of face and hair components between avatars while maintaining the subject's identity. Furthermore, we demonstrate that our model can be finetuned with a monocular …
Oral
Haiwen Huang · Anpei Chen · Volodymyr Havrylov · Andreas Geiger · Dan Zhang

[ Exhibit Hall III ]

Abstract
Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks.
Oral
Sindhu Hegde · K R Prajwal · Taein Kwon · Andrew Zisserman

[ Kalakaua Ballroom ]

Abstract
Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations: (i) gesture-based retrieval, (ii) gestured word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs), across all three tasks. Further analysis reveals that speech and text modalities capture distinct gesture-related signals, underscoring the advantages of learning a shared tri-modal embedding space. All code, models, and data annotations will be released to support future research.
Oral
Ziwei Wang · Sameera Ramasinghe · Chenchen Xu · Julien Monteil · Loris Bazzani · Thalaiyasingam Ajanthan

[ Exhibit Hall III ]

Abstract
Structuring latent representations in a hierarchical manner enables models to learn patterns at multiple levels of abstraction. However, most prevalent image understanding models focus on visual similarity, and learning visual hierarchies is relatively unexplored. In this work, for the first time, we introduce a learning paradigm that can encode user-defined multi-level complex visual hierarchies in hyperbolic space without requiring explicit hierarchical labels. As a concrete example, first, we define a part-based image hierarchy using object-level annotations within and across images. Then, we introduce an approach to enforce the hierarchy using contrastive loss with pairwise entailment metrics. Finally, we discuss new evaluation metrics to effectively measure hierarchical image retrieval. Encoding these complex relationships ensures that the learned representations capture semantic and structural information that transcends mere visual similarity. Experiments in part-based image retrieval show significant improvements in hierarchical retrieval tasks, demonstrating the capability of our model in capturing visual hierarchies.
Oral
Junzhe Lu · Jing Lin · Hongkun Dou · Ailing Zeng · Yue Deng · Xian Liu · Zhongang Cai · Lei Yang · YULUN ZHANG · Haoqian Wang · Ziwei Liu

[ Kalakaua Ballroom ]

Abstract
We present DPoser-X, a diffusion-based prior model for 3D whole-body human poses. Building a versatile and robust full-body human pose prior remains challenging due to the inherent complexity of articulated human poses and the scarcity of high-quality whole-body pose datasets. To address these limitations, we introduce a Diffusion model as body Pose prior (DPoser) and extend it to DPoser-X for expressive whole-body human pose modeling.Our approach unifies various pose-centric tasks as inverse problems, solving them through variational diffusion sampling. To enhance performance on downstream applications, we introduce a novel truncated timestep scheduling method specifically designed for pose data characteristics. We also propose a masked training mechanism that effectively combines whole-body and part-specific datasets, enabling our model to capture interdependencies between body parts while avoiding overfitting to specific actions.Extensive experiments demonstrate DPoser-X's robustness and versatility across multiple benchmarks for body, hand, face, and full-body pose modeling. Our model consistently outperforms state-of-the-art alternatives, establishing a new benchmark for whole-body human pose prior modeling.
Oral
Tingting Zheng · Hongxun Yao · Kui Jiang · Yi Xiao · Sicheng Zhao

[ Exhibit Hall III ]

Abstract
Recent advances in selective state space models (Mamba) have shown great promise in whole slide image (WSI) classification. Despite this, WSIs contain explicit local redundancy (similar patches) and irrelevant regions (uninformative instances), posing significant challenges for Mamba-based multi-instance learning (MIL) methods in capturing global representations. Furthermore, bag-level approaches struggle to extract critical features from all instances, while group-level methods fail to adequately account for tumor dispersion and intrinsic correlations across groups, leading to suboptimal global representations. To address these issues, we propose group masking Mamba (GMMamba), a novel framework that combines two elaborate modules: (1) intra-group masking Mamba (IMM) for selective instance exploration within groups, and (2) cross-group super-feature sampling (CSS) to ameliorate long-range relation learning. Specifically, IMM adaptively predicts sparse masks to filter out features with low attention scores (i.e., uninformative patterns) during bidirectional Mamba modeling, facilitating the removal of instance redundancies for compact local representation. For improved bag prediction, the CSS module further aggregates sparse group representations into discriminative features, effectively grasping comprehensive dependencies among dispersed and sparse tumor regions inherent in large-scale WSIs. Extensive experiments on four datasets demonstrate that GMMamba outperforms the state-of-the-art ACMIL by 2.2\% and 6.4\% in accuracy on the TCGA-BRCA and TCGA-ESCA datasets, …
Oral
Weixi Zheng · Jingwang Ling · Zhibo Wang · Quan Wang · Feng Xu

[ Kalakaua Ballroom ]

Abstract
We present the first method for personalized dental shape reconstruction and teeth-inclusive facial performance capture using only a single phone camera. Our approach democratizes high-quality facial avatars through a non-invasive, low-cost setup by addressing the ill-posed monocular capture problem with an analysis-by-synthesis approach. We introduce a representation adaptation technique that maintains both mesh and SDF representations of teeth, enabling efficient differentiable rendering while preventing teeth-lip interpenetration. To overcome alignment challenges with similar-appearing dental components, we leverage foundation models for semantic teeth segmentation and design specialized optimization objectives. Our method addresses the challenging occlusions of teeth during facial performance through optimization strategies that leverage facial structural priors, while our semantic mask rendering loss with optimal transport-based matching ensures convergence despite significant variations in initial positioning. Code will be released.
Oral
Zichen Liu · Yihao Meng · Hao Ouyang · Yue Yu · Bolin Zhao · Daniel Cohen-Or · Huamin Qu

[ Exhibit Hall III ]

Abstract
Text animation serves as an expressive medium, transforming static communication into dynamic experiences by infusing words with motion to evoke emotions, emphasize meanings, and construct compelling narratives. Crafting animations that are semantically aware poses significant challenges, demanding expertise in graphic design and animation. We present an automated text animation scheme, termed "Dynamic Typography", which deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts. The animation is represented by a canonical field that aggregates the semantic content in a canonical shape and a deformation field that applies per-frame motion to deform the canonical shape. Two fields are jointly optimized by the priors from a large pretrained text-to-video diffusion model using score-distillation loss with designed regularization, encouraging the video coherence with the intended textual concept while maintaining legibility and structural integrity throughout the animation process. We demonstrate the generalizability of our approach across various text-to-video models and highlight the superiority of our methodology over baselines. Through quantitative and qualitative evaluations, we demonstrate the effectiveness of our framework in generating coherent text animations that faithfully interpret user prompts while maintaining readability.
Oral
Lennart Bastian · Mohammad Rashed · Nassir Navab · Tolga Birdal

[ Kalakaua Ballroom ]

Abstract
Modeling the rotation of moving objects is a fundamental task in computer vision, yet $SO(3)$ extrapolation still presents numerous challenges: (1) unknown quantities such as the moment of inertia complicate dynamics, (2) the presence of external forces and torques can lead to non-conservative kinematics, and (3) estimating evolving state trajectories under sparse, noisy observations requires robustness.We propose modeling trajectories of noisy pose estimates on the manifold of 3D rotations in a physically and geometrically meaningful way by leveraging Neural Controlled Differential Equations guided with $SO(3)$ Savitzky-Golay paths.Existing extrapolation methods often rely on energy conservation or constant velocity assumptions, limiting their applicability in real-world scenarios involving non-conservative forces. In contrast, our approach is agnostic to energy and momentum conservation while being robust to input noise, making it applicable to complex, non-inertial systems. Our approach is easily integrated as a module in existing pipelines and generalizes well to trajectories with unknown physical parameters.By learning to approximate object dynamics from noisy states during training, our model attains robust extrapolation capabilities in simulation and various real-world settings.
Oral
Ava Pun · Kangle Deng · Ruixuan Liu · Deva Ramanan · Changliu Liu · Jun-Yan Zhu

[ Exhibit Hall III ]

Abstract
We introduce LegoGPT, the first approach for generating physically stable LEGO brick models from text prompts. To achieve this, we construct a large-scale, physically stable dataset of LEGO designs, along with their associated captions, and train an autoregressive large language model to predict the next brick to add via next-token prediction. To improve the stability of the resulting designs, we employ an efficient validity check and physics-aware rollback during auto-regressive inference, which prunes infeasible token predictions using physics laws and assembly constraints. Our experiments show that LegoGPT produces stable, diverse, and aesthetically pleasing LEGO designs that align closely with the input text prompts. We also develop a text-based LEGO texturing method, enabling us to generate colored and textured designs. We show that our designs can be assembled by humans manually as well as by robotic arms automatically. Upon publication, we will release our new dataset, StableText2Lego, which contains over 47,000 LEGO structures of over 28,000 unique 3D objects accompanied by detailed captions, along with our code and models.
Oral
Carl Olsson · Yaroslava Lochman · Johan Malmport · Christopher Zach

[ Kalakaua Ballroom ]

Abstract
Rotation averaging is a key subproblem in applications of computer vision and robotics. Many methods for solving this problem exist, and there are also several theoretical results analyzing difficulty and optimality. However, one aspect that most of these have in common is a focus on the isotropic setting, where the intrinsic uncertainties in the measurements are not fully incorporated into the resulting optimization task. Recent empirical results suggest that moving to an anisotropic framework, where these uncertainties are explicitly included, can result in an improvement of solution quality. However, global optimization for rotation averaging has remained a challenge in this scenario.In this paper we show how anisotropic costs can be incorporated in certifiably optimal rotation averaging. We also demonstrate how existing solvers, designed for isotropic situations, fail in the anisotropic setting. Finally, we propose a stronger relaxation and show empirically that it is able to recover global optima in all tested datasets and leads to a more accurate reconstruction in all but one of the scenes.
Oral
Richard Liu · Daniel Fu · Noah Tan · Itai Lang · Rana Hanocka

[ Exhibit Hall III ]

Abstract
In this work we present WIR3D, a technique for abstracting 3D shapes through a sparse set of visually meaningful curves in 3D. We optimize the parameters of Bezier curves such that they faithfully represent both the geometry and salient visual features (e.g. texture) of the shape from arbitrary viewpoints. We leverage the intermediate activations of a pre-trained foundation model (CLIP) to guide our optimization process. We divide our optimization into two phases: one for capturing the coarse geometry of the shape, and the other for representing fine-grained features. Our second phase supervision is spatially guided by a novel localized keypoint loss. This spatial guidance enables user control over abstracted features. We ensure fidelity to the original surface through a neural SDF loss, which allows the curves to be used as intuitive deformation handles. We successfully apply our method for shape abstraction over a broad dataset of shapes with varying complexity, geometric structure, and texture, and demonstrate downstream applications for feature control and shape deformation.
Oral
Jinghao Wang · Zhang Li · Zi Wang · Banglei Guan · Yang Shang · Qifeng Yu

[ Kalakaua Ballroom ]

Abstract
Recently, 6D pose confidence region estimation has emerged as a critical direction, aiming to perform uncertainty quantification for assessing the reliability of estimated poses. However, current sampling-based approach suffers from critical limitations that severely impede their practical deployment: 1) the sampling speed significantly decreases as the number of samples increases. 2) the derived confidence regions are often excessively large. To address these challenges, we propose a deterministic and efficient method for estimating pose confidence regions. Our approach uses inductive conformal prediction to calibrate the deterministically regressed Gaussian keypoint distributions into 2D keypoint confidence regions. We then leverage the implicit function theorem to propagate these keypoint confidence regions directly into 6D pose confidence regions. This method avoids the inefficiency and inflated region sizes associated with sampling and ensembling, providing compact confidence regions that cover the ground-truth poses with a user-defined confidence level. Experimental results on the LineMOD Occlusion and SPEED datasets show that our method achieves higher pose estimation accuracy with reduced computational time. For the same coverage rate, our method yields significantly smaller confidence region volumes, reducing them by up to 99.9% for rotations and 99.8% for translations. The code will be available soon.
Oral
Xianglong He · Zi-Xin Zou · Chia Hao Chen · Yuan-Chen Guo · Ding Liang · Chun Yuan · Wanli Ouyang · Yanpei Cao · Yangguang Li

[ Exhibit Hall III ]

Abstract
Creating high-fidelity 3D meshes with arbitrary topology, including open surfaces and complex interiors, remains a significant challenge. Existing implicit field methods often require costly and detail-degrading watertight conversion, while other approaches struggle with high resolutions. This paper introduces SparseFlex, a novel sparse-structured isosurface representation that enables differentiable mesh reconstruction at resolutions up to 1024^3 directly from rendering losses. SparseFlex combines the accuracy of Flexicubes with a sparse voxel structure, focusing computation on surface-adjacent regions and efficiently handling open surfaces. Crucially, we introduce a frustum-aware sectional voxel training strategy that activates only relevant voxels during rendering, dramatically reducing memory consumption and enabling high-resolution training. This also allows, for the first time, the reconstruction of mesh interiors using only rendering supervision. Building upon this, we demonstrate a complete shape modeling pipeline by training a variational autoencoder (VAE) and a rectified flow transformer for high-quality 3D shape generation. Our experiments show state-of-the-art reconstruction accuracy, with a ~82% reduction in Chamfer Distance and a ~88% increase in F-score compared to previous methods, and demonstrate the generation of high-resolution, detailed 3D shapes with arbitrary topology. By enabling high-resolution, differentiable mesh reconstruction and generation with rendering losses, SparseFlex significantly advances the state-of-the-art in 3D shape representation …
Oral
Yaqing Ding · Viktor Kocur · VACLAV VAVRA · Zuzana Berger Haladova · jian Yang · Torsten Sattler · Zuzana Kukelova

[ Kalakaua Ballroom ]

Abstract
Recent advances in monocular depth estimation methods (MDE) and their improved accuracy open new possibilities for their applications. In this paper, we investigate how monocular depth estimates can be used for relative pose estimation. In particular, we are interested in answering the question whether using MDEs improves results over traditional point-based methods. We propose a novel framework for estimating the relative pose of two cameras from point correspondences with associated monocular depths. Since depth predictions are typically defined up to an unknown scale or even both unknown scale and shift parameters, our solvers jointly estimate the scale or both the scale and shift parameters along with the relative pose. We derive efficient solvers considering different types of depths for three camera configurations: (1) calibrated cameras, (2) cameras with an unknown shared focal length, and (3) cameras with unknown different focal lengths. Our new solvers outperform state-of-the-art depth-aware solvers in terms of speed and accuracy. In extensive real experiments on multiple datasets and with various MDEs, we discuss which depth-aware solvers are preferable in which situation. The code will be made publicly available.
Oral
Chengtang Yao · Lidong Yu · Zhidan Liu · Jiaxi Zeng · Yuwei Wu · Yunde Jia

[ Kalakaua Ballroom ]

Abstract
The matching formulation makes it naturally hard for the stereo matching to handle ill-posed regions like occlusions and non-Lambertian surfaces. Fusing monocular priors has been proven helpful for ill-posed matching, but the biased monocular prior learned from small stereo datasets constrains the generalization. Recently, stereo matching has progressed by leveraging the unbiased monocular prior from the vision foundation model (VFM) to improve the generalization in ill-posed regions. We dive into the fusion process and observe three main problems limiting the fusion of the VFM monocular prior. The first problem is the misalignment between affine-invariant relative monocular depth and absolute depth of disparity. Besides, when we use the monocular feature in an iterative update structure, the over-confidence in the disparity update leads to local optima results. A direct fusion of a monocular depth map could alleviate the local optima problem, but noisy disparity results computed at the first several iterations will misguide the fusion. In this paper, we propose a binary local ordering map to guide the fusion, which converts the depth map into a binary relative format, unifying the relative and absolute depth representation. The computed local ordering map is also used to re-weight the initial disparity update, resolving the …
Oral
Jianhong Bai · Menghan Xia · Xiao Fu · Xintao Wang · Lianrui Mu · Jinwen Cao · Zuozhu Liu · Haoji Hu · Xiang Bai · Pengfei Wan · Di ZHANG

[ Exhibit Hall III ]

Abstract
Camera control has been actively studied in text or image conditioned video generation tasks. However, altering camera trajectories of a given video remains under-explored, despite its importance in the field of video creation. It is non-trivial due to the extra constraints of maintaining multiple-frame appearance and dynamic synchronization. To address this, we present ReCamMaster, a camera-controlled generative video re-rendering framework that reproduces the dynamic scene of an input video at novel camera trajectories. The core innovation lies in harnessing the generative capabilities of pre-trained text-to-video models through an elegant yet powerful video conditioning mechanism—an aspect often overlooked in current research. To overcome the scarcity of qualified training data, we construct a comprehensive multi-camera synchronized video dataset using Unreal Engine 5, which is carefully curated to follow real-world filming characteristics, covering diverse scenes and camera movements. It helps the model generalize to in-the-wild videos. Lastly, we further improve the robustness to diverse inputs through a meticulously designed training strategy. Extensive experiments tell that our method substantially outperforms existing state-of-the-art approaches and strong baselines. Our method also finds promising applications in video stabilization, super-resolution, and outpainting. Our code and dataset will be publicly available.
Oral
Hanyi Wang · Han Fang · Shi-Lin Wang · Ee-Chien Chang

[ Kalakaua Ballroom ]

Abstract
Generative image watermarking enables the proactive detection and traceability of generated images. Among existing methods, inversion-based frameworks achieve highly conceal ed watermark embedding by injecting watermarks into the latent representation before the diffusion process. The robustness of this approach hinges on both the embedding mechanism and inversion accuracy. However, prior works have predominantly focused on optimizing the embedding process while overlooking inversion errors, which significantly affect extraction fidelity. In this paper, we address the challenge of inversion errors and propose ROAR, a dual-domain optimization-based framework designed to mitigate errors arising from two key sources: 1) Latent-domain errors, which accumulate across inversion steps due to inherent approximation assumptions. 2) Pixel-domain errors, which result from channel distortions such as JPEG compression. To tackle these issues, we introduce two novel components: A \textbf{Regeneration-based Optimization (RO)} mechanism, which incorporates an optimizable starting latent to minimize latent-domain errors; A Mixture of Experts (MoE)-based \textbf{distortion-adaptive restoration (AR)} network, which effectively recovers watermarked distributions from pixel-level distortions.Extensive experiments demonstrate that ROAR significantly reduces inversion errors and enhances watermark extraction robustness, thereby improving the reliability of generative image watermarking.
Oral
Xiaohang Zhan · Dingming Liu

[ Exhibit Hall III ]

Abstract
We propose a novel training-free image generation algorithm that precisely controls the occlusion relationships between objects in an image. Existing image generation methods typically rely on prompts to influence occlusion, which often lack precision. While layout-to-image methods provide control over object locations, they fail to address occlusion relationships explicitly. Given a pre-trained image diffusion model, our method leverages volume rendering principles to ``render'' the scene in latent space, guided by occlusion relationships and the estimated transmittance of objects. This approach does not require retraining or fine-tuning the image diffusion model, yet it enables accurate occlusion control due to its physics-grounded foundation. In extensive experiments, our method significantly outperforms existing approaches in terms of occlusion accuracy. Furthermore, we demonstrate that by adjusting the opacities of objects or concepts during rendering, our method can achieve a variety of effects, such as altering the transparency of objects, the density of mass (e.g., forests), the concentration of particles (e.g., rain, fog), the intensity of light, and the strength of lens effects, etc.
Oral
Yi Chen · Yuying Ge · Weiliang Tang · Yizhuo Li · Yixiao Ge · Mingyu Ding · Ying Shan · Xihui Liu

[ Kalakaua Ballroom ]

Abstract
Recent developments in Large Language Models (LLMs) pre-trained on extensive corpora have shown significant success in various natural language processing (NLP) tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich "corpus", **can a similar generative pre-training approach be effectively applied to enhance robot learning?** The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks.Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer of learned motions to actual robot actions. To this end, we introduce **Moto**, which converts video content into latent **Mo**tion **To**ken sequences by a Latent Motion Tokenizer, learning a bridging "language" of motion from videos in an unsupervised manner. We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge. After pre-training, Moto-GPT demonstrates the promising ability to produce semantically interpretable motion tokens, predict plausible motion trajectories, and assess trajectory rationality through output …
Oral
Jiaxu Zhang · Xianfang Zeng · Xin Chen · Wei Zuo · Gang YU · Zhigang Tu

[ Exhibit Hall III ]

Abstract
We propose MikuDance, a diffusion-based pipeline incorporating mixed motion dynamics to animate stylized character art. MikuDance consists of two key techniques: Mixed Motion Modeling and Mixed-Control Diffusion, to address the challenges of high-dynamic motion and reference-guidance misalignment in character art animation. Specifically, a Scene Motion Tracking strategy is presented to explicitly model the dynamic camera in pixel-wise space, enabling unified character-scene motion modeling. Building on this, the Mixed-Control Diffusion implicitly aligns the scale and body shape of diverse characters with motion guidance, allowing flexible control of local character motion. Subsequently, a Motion-Adaptive Normalization module is incorporated to effectively inject global scene motion, paving the way for comprehensive character art animation. Through extensive experiments, we demonstrate the effectiveness and generalizability of MikuDance across various character art and motion guidance, consistently producing high-quality animations with remarkable motion dynamics.
Oral
Peng Du · Hui Li · Han Xu · Paul Jeon · Dongwook Lee · Daehyun Ji · Ran Yang · Feng Zhu

[ Exhibit Hall III ]

Abstract
Discrete Wavelet Transform (DWT) has been widely explored to enhance the performance of image super-resolution (SR). Despite some DWT-based methods improving SR by capturing fine-grained frequency signals, most existing approaches neglect the interrelations among multi-scale frequency sub-bands, resulting in inconsistencies and unnatural artifacts in the reconstructed images. To address this challenge, we propose a Diffusion Transformer model based on image Wavelet spectra for SR (DTWSR). DTWSR incorporates the superiority of diffusion models and transformers to capture the interrelations among multi-scale frequency sub-bands, leading to a more consistence and realistic SR image. Specifically, we use a Multi-level Discrete Wavelet Transform (MDWT) to decompose images into wavelet spectra. A pyramid tokenization method is proposed which embeds the spectra into a sequence of tokens for transformer model, facilitating to capture features from both spatial and frequency domain. A dual-decoder is designed elaborately to handle the distinct variances in low-frequency (LF) and high-frequency (HF) sub-bands, without omiting their alignment in image generation. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our method, with high performance on both perception quality and fidelity.
Oral
Seungju Yoo · Hyuk Kwon · Joong-Won Hwang · Kibok Lee

[ Kalakaua Ballroom ]

Abstract
Object detection is a fundamental task in computer vision that has received significant attention in recent years. Despite advances in training object detection models, evaluating their performance in real-world applications remains challenging due to the substantial costs associated with image annotation. To address this issue, we propose Prediction Consistency and Reliability (PCR) as an automated model evaluation (AutoEval) method for object detection. Our method is motivated by the observation that most existing object detection models generate many candidate predictions, which are subsequently filtered through non-maximum suppression (NMS). Specifically, we analyze 1) the consistency between the final and redundant predictions and 2) the reliability of these predictions determined by their confidence scores, and propose PCR by examining their relationships with object detection performance. Furthermore, to facilitate a more realistic assessment of AutoEval methods for object detection, we construct meta-datasets incorporating various corruptions. Experimental results demonstrate the superior performance of PCR compared to the existing AutoEval methods.
Oral
Federico Girella · Davide Talon · Ziyue Liu · Zanxi Ruan · Yiming Wang · Marco Cristani

[ Exhibit Hall III ]

Abstract
Fashion design is a complex creative process that blends visual and textual expressions. Designers convey ideas through sketches, which define spatial structure and design elements, and textual descriptions, capturing material, texture, and stylistic details. In this paper, we present LOcalized Text and Sketch for fashion image generation (LOTS), an approach for compositional sketch-text based generation of complete fashion outlooks. LOTS leverages a global description with paired localized sketch + text information for conditioning and introduces a novel step-based merging strategy for diffusion adaptation. First, a Modularized Pair-Centric representation encodes sketches and text into a shared latent space while preserving independent localized features; then, a Diffusion Pair Guidance phase integrates both local and global conditioning via attention-based guidance within the diffusion model’s multi-step denoising process. To validate our method, we build on Fashionpedia to release Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Quantitative results show LOTS achieves state-of-the-art image generation performance on both global and localized metrics, while qualitative examples and a human evaluation study highlight its unprecedented level of design customization.
Oral
Corentin Dumery · Noa Ette · Aoxiang Fan · Ren Li · Jingyi Xu · Hieu Le · Pascal Fua

[ Kalakaua Ballroom ]

Abstract
Visual object counting is a fundamental computer vision task underpinning numerous real-world applications, from cell counting in biomedicine to traffic and wildlife monitoring. However, existing methods struggle to handle the challenge of stacked 3D objects in which most objects are hidden by those above them. To address this important yet underexplored problem, we propose a novel 3D counting approach that decomposes the task into two complementary subproblems - estimating the 3D geometry of the object stack and the occupancy ratio from multi-view images. By combining geometric reconstruction and deep learning-based depth analysis, our method can accurately count identical objects within containers, even when they are irregularly stacked. We validate our 3D Counting pipeline on diverse real-world and large-scale synthetic datasets, which we will release publicly to facilitate further research.
Oral
Vladimir Kulikov · Matan Kleiner · Inbar Huberman-Spiegelglas · Tomer Michaeli

[ Exhibit Hall III ]

Abstract
Editing real images using a pre-trained text-to-image (T2I) diffusion/flow model often involves inverting the image into its corresponding noise map. However, inversion by itself is typically insufficient for obtaining satisfactory results, and therefore many methods additionally intervene in the sampling process. Such methods achieve improved results but are not seamlessly transferable between model architectures. Here, we introduce FlowEdit, a text-based editing method for pre-trained T2I flow models, which is inversion-free, optimization-free and model agnostic. Our method constructs an ODE that directly maps between the source and target distributions (corresponding to the source and target text prompts) and achieves a lower transport cost than the inversion approach. This leads to state-of-the-art results, as we illustrate with Stable Diffusion 3 and FLUX.
Oral
George Ciubotariu · Zhuyun Zhou · Zongwei Wu · Radu Timofte

[ Kalakaua Ballroom ]

Abstract
We introduce MIORe and VAR-MIORe, novel multi-task datasets that address critical limitations in current benchmarks for motion restoration tasks. Our datasets capture a broad spectrum of motion scenarios—including complex ego-camera movements, dynamic multi-subject interactions, and depth-dependent blur effects—using high-frame-rate (1000 FPS) acquisition and professional-grade optics. By averaging variable numbers of frames based on computed optical flow metrics, MIORe generates consistent motion blur while preserving sharp inputs for video frame interpolation and optical flow estimation. VAR-MIORe further extends this framework by spanning a variable range of motion magnitudes, from minimal to extreme, establishing the first benchmark of its kind. Together, these datasets provide high-resolution, scalable ground truth that challenges existing algorithms under both controlled and adverse conditions, paving the way for next-generation research in non-uniform deblurring, video interpolation, and optical flow analysis.
Oral
Yiren Song · Danze Chen · Mike Zheng Shou

[ Exhibit Hall III ]

Abstract
Generating cognitive-aligned layered SVGs remains challenging due to existing methods’ tendencies toward either oversimplified single-layer outputs or optimization-induced shape redundancies. We propose LayerTracer, a DiT based framework that bridges this gap by learning designers’ layered SVG creation processes from a novel dataset of sequential design operations. Our approach operates in two phases: First, a text-conditioned DiT generates multi-phase rasterized construction blueprints that simulate human design workflows. Second, layer-wise vectorization with path deduplication produces clean, editable SVGs. For image vectorization, we introduce a conditional diffusion mechanism that encodes reference images into latent tokens, guiding hierarchical reconstruction while preserving structural integrity. Extensive experiments show that LayerTracer surpasses optimization-based and neural baselines in generation quality and editability.
Oral
Ziv Weiss Haddad · Oren Barkan · Yehonatan Elisha · Noam Koenigstein

[ Kalakaua Ballroom ]

Abstract
Completeness is a widely discussed property in explainability research, requiring that the attributions sum to the model’s response to the input. While completeness intuitively suggests that the model’s prediction is "completely explained" by the attributions, its global formulation alone is insufficient to ensure meaningful explanations. We contend that promoting completeness locally within attribution subregions, in a soft manner, can serve as a standalone guiding principle for producing high quality attributions. To this end, we introduce the concept of the completeness gap as a flexible measure of completeness and propose an optimization procedure that minimizes this gap across subregions within the attribution map. Extensive evaluations across various model architectures demonstrate that our method outperforms state-of-the-art explanation methods on multiple benchmarks.
Oral
Dengke Zhang · Fagui Liu · Quan Tang

[ Kalakaua Ballroom ]

Abstract
Open-vocabulary semantic segmentation aims to assign semantic labels to each pixel without being constrained by a predefined set of categories. While Contrastive Language-Image Pre-training (CLIP) excels in zero-shot classification, it struggles to align image patches with category embeddings because of its incoherent patch correlations. This study reveals that inter-class correlations are the main reason for impairing CLIP's segmentation performance. Accordingly, we propose CorrCLIP, which reconstructs the scope and value of patch correlations. Specifically, CorrCLIP leverages the Segment Anything Model (SAM) to define the scope of patch interactions, reducing inter-class correlations. To mitigate the problem that SAM-generated masks may contain patches belonging to different classes, CorrCLIP incorporates self-supervised models to compute coherent similarity values, suppressing the weight of inter-class correlations. Additionally, we introduce two additional branches to strengthen patch features’ spatial details and semantic representation. Finally, we update segmentation maps with SAM-generated masks to improve spatial consistency. Based on the improvement across patch correlations, feature representations, and segmentation maps, CorrCLIP achieves superior performance across eight benchmarks.
Oral
Elisabetta Fedele · Boyang Sun · Francis Engelmann · Marc Pollefeys · Leonidas Guibas

[ Exhibit Hall III ]

Abstract
We present SuperDec, an approach for compact 3D scene representations based on geometric primitives, namely superquadrics.While most recent works leverage geometric primitives to obtain photorealistic 3D scene representations, we propose to leverage them to obtain a compact yet expressive representation. We propose to solve the problem locally on individual objects and leverage the capabilities of instance segmentation methods to scale our solution to full 3D scenes. In doing that, we design a new architecture which efficiently decompose point clouds of arbitrary objects in a compact set of superquadrics. We train our architecture on ShapeNet and we prove its generalization capabilities on object instances extracted from the ScanNet++ dataset as well as on full Replica scenes. Finally, we show how a compact representation based on superquadrics can be useful for a diverse range of downstream applications, including robotic tasks and controllable visual content generation and editing.
Oral
WEIMING ZHANG · Dingwen Xiao · Lei Chen · Lin Wang

[ Kalakaua Ballroom ]

Abstract
Entity Segmentation (ES) aims at identifying and segmenting distinct entities within an image without the need for predefined class labels. This characteristic makes ES well-suited to open-world applications with adaptation to diverse and dynamically changing environments, where new and previously unseen entities may appear frequently. Existing ES methods either require large annotated datasets or high training costs, limiting their scalability and adaptability. Recently, the Segment Anything Model (SAM), especially in its Automatic Mask Generation (AMG) mode, has shown potential for holistic image segmentation. However, it struggles with over-segmentation and under-segmentation, making it less effective for ES. In this paper, we introduce E-SAM, a novel training-free framework that exhibits exceptional ES capability. Specifically, we first propose Multi-level Mask Generation (MMG) that hierarchically processes SAM's AMG outputs to generate reliable object-level masks while preserving fine details at other levels. Entity-level Mask Refinement (EMR) then refines these object-level masks into accurate entity-level masks. That is, it separates overlapping masks to address the redundancy issues inherent in SAM's outputs and merges similar masks by evaluating entity-level consistency. Lastly, Under-Segmentation Refinement (USR) addresses under-segmentation by generating additional high-confidence masks fused with EMR outputs to produce the final ES map. These three modules are seamlessly optimized …
Oral
Hamadi Chihaoui · Paolo Favaro

[ Exhibit Hall III ]

Abstract
Zero-shot image restoration (IR) methods based on pretrained diffusion models have recently achieved significant success. These methods typically require at least a parametric form of the degradation model. However, in real-world scenarios, the degradation may be too complex to define explicitly. To handle this general case, we introduce the Diffusion Image Prior(DIIP). We take inspiration from the Deep Image Prior (DIP). since it can be used to remove artifacts without the need for an explicit degradation model. However, in contrast to DIP, we find that pretrained diffusion models offer a much stronger prior, despite being trained without knowledge from corrupted data. We show that, the optimization process in DIIP first reconstructs a clean version of the image before eventually overfitting to the degraded input, but it does so for a broader range of degradations than DIP. In light of this result, we propose a blind image restoration (IR) method based on early stopping, which does not require prior knowledge of the degradation model. We validate \methodnameacr on various degradation-blind IR tasks, including JPEG artifact removal, deblurring, denoising and super-resolution with state-of-the-art results.
Oral

[ Exhibit Hall III ]

Abstract
A lens brings a $\textit{single}$ plane into focus on a planar sensor; hence, parts of the scene that are outside this planar focus plane are resolved on the sensor under defocus. Can we break this precept by enabling a lens that can change its depth-of-field arbitrarily? This work investigates the design and implementation of such a computational lens with spatially-selective focusing. Our design uses an optical arrangement of Lohmann lenses and phase spatial light modulators to allow each pixel to focus onto a different depth. We extend classical techniques used in autofocusing to the spatially-varying scenario where the depth map is iteratively estimated using contrast and disparity cues, enabling the camera to progressively shape its depth-of-field to the scene's depth. By obtaining an optical all-in-focus image, our technique advances upon a broad swathe of prior work ranging from depth-from-focus/defocus to coded aperture techniques in two key aspects: the ability to bring an entire scene in focus simultaneously, and the ability to maintain the highest possible spatial resolution.
Oral
Yiqing Shen · Bohan Liu · Chenjia Li · Lalithkumar Seenivasan · Mathias Unberath

[ Kalakaua Ballroom ]

Abstract
Reasoning segmentation (RS) aims to identify and segment objects of interest based on implicit text queries. As such, RS is a catalyst for embodied AI agents, enabling them to interpret high-level commands without requiring explicit step-by-step guidance. However, current RS approaches rely heavily on the visual perception capabilities of multimodal large language models (LLMs), leading to several major limitations. First, they struggle with queries that require multiple steps of reasoning or those that involve complex spatial/temporal relationships. Second, they necessitate LLM fine-tuning, which may require frequent updates to maintain compatibility with contemporary LLMs and may increase risks of catastrophic forgetting during fine-tuning. Finally, being primarily designed for static images or offline video processing, they scale poorly to online video data. To address these limitations, we propose an agent framework that disentangles perception and reasoning for online video RS without LLM fine-tuning. Our innovation is the introduction of a just-in-time digital twin concept, where -- given an implicit query -- an LLM plans the construction of a low-level scene representation from high-level video using specialist vision models. We refer to this approach to creating a digital twin as "just-in-time" because the LLM planner will anticipate the need for specific information and …
Oral
Andrea Simonelli · Norman Müller · Peter Kontschieder

[ Kalakaua Ballroom ]

Abstract
The increasing availability of digital 3D environments, whether through image reconstruction, generation, or scans obtained via lasers or robots, is driving innovation across various fields. Among the numerous applications, there is a significant demand for those that enable 3D interaction, such as 3D Interactive Segmentation, which is useful for tasks like object selection and manipulation. Additionally, there is a persistent need for solutions that are efficient, precise, and consistently perform well across diverse settings, particularly in unseen environments and with unfamiliar objects. In this work, we introduce a method that consistently surpasses previous state-of-the-art techniques on both in-domain and out-of-domain datasets. Our simple approach integrates a voxel-based sparse encoder with a lightweight transformer-based decoder that implements implicit click fusion, achieving superior performance and maximizing efficiency. Our method demonstrates substantial improvements on benchmark datasets, including ScanNet, ScanNet++, S3DIS, and KITTI-360, and also on unseen geometric distributions such as Gaussian Splatting.
Oral

[ Exhibit Hall III ]

Abstract
mmWave radars are compact, inexpensive, and durable sensors that are robust to occlusions and work regardless of environmental conditions, such as weather and darkness. However, this comes at the cost of poor angular resolution, especially for inexpensive single-chip radars, which are typically used in automotive and indoor sensing applications. Although many have proposed learning-based methods to mitigate this weakness, no standardized foundational models or large datasets for the mmWave radar have emerged, and practitioners have largely trained task-specific models from scratch using relatively small datasets.In this paper, we collect (to our knowledge) the largest available raw radar dataset with 1M samples (29 hours) and train a foundational model for 4D single-chip radar, which can predict 3D occupancy and semantic segmentation with quality that is typically only possible with much higher resolution sensors. We demonstrate that our Generalizable Radar Transformer (GRT) generalizes across diverse settings, can be fine-tuned for different tasks, and shows logarithmic data scaling of 20\% per $10\times$ data. We also run extensive ablations on common design decisions, and find that using raw radar data significantly outperforms widely-used lossy representations, equivalent to a $10\times$ increase in training data. Finally, we estimate a total data requirement of $\approx$100M samples (3000 …
Oral
Binbin Xiang · Maciej Wielgosz · Stefano Puliti · Kamil Král · Martin Krůček · Azim Missarov · Rasmus Astrup

[ Kalakaua Ballroom ]

Abstract
The segmentation of forest LiDAR 3D point clouds, including both individual tree and semantic segmentation, is fundamental for advancing forest management and ecological research. However, current approaches often struggle with the complexity and variability of natural forest environments. We present ForestFormer3D, a new unified and end-to-end framework designed for precise individual tree and semantic segmentation. ForestFormer3D incorporates ISA-guided query point selection, a score-based block merging strategy during inference, and a one-to-many association mechanism for effective training. By combining these new components, our model achieves state-of-the-art performance for individual tree segmentation on the newly introduced FOR-instanceV2 dataset, which spans diverse forest types and regions. Additionally, ForestFormer3D generalizes well to unseen test sets (Wytham woods and LAUTx), showcasing its robustness across different forest conditions and sensor modalities. The FOR-instanceV2 dataset and the ForestFormer3D code will be released post-acceptance.
Oral
Xinyu Zhou · Peiqi Duan · Yeliduosi Xiaokaiti · Chao Xu · Boxin Shi

[ Exhibit Hall III ]

Abstract
Visual vibrometry has emerged as a powerful technique for remote acquisition of audio signals and the physical properties of materials. To capture high-frequency vibrations, frame-based visual vibrometry approaches often require a high-speed video camera and bright lighting to compensate for the short exposure time. In this paper, we introduce event-based visual vibrometry, a new high-speed visual vibration sensing method using an event camera. Exploiting the high temporal resolution, dynamic range, and low bandwidth characteristics of event cameras, event-based visual vibrometry achieves high-speed vibration sensing under common lighting conditions with enhanced data efficiency. Specifically, we leverage a hybrid camera system and propose an event-based subtle motion estimation framework that integrates an optimization-based approach for estimating coarse motion within short time intervals and a neural network to mitigate the inaccuracies in the coarse motion estimation. We demonstrate our method by capturing vibration caused by audio sources and estimating material properties for various objects.