Oral
Oral 5B: Applications and evaluation
Kalakaua Ballroom
ROAR: Reducing Inversion Error in Generative Image Watermarking
Hanyi Wang · Han Fang · Shi-Lin Wang · Ee-Chien Chang
Generative image watermarking enables the proactive detection and traceability of generated images. Among existing methods, inversion-based frameworks achieve highly conceal ed watermark embedding by injecting watermarks into the latent representation before the diffusion process. The robustness of this approach hinges on both the embedding mechanism and inversion accuracy. However, prior works have predominantly focused on optimizing the embedding process while overlooking inversion errors, which significantly affect extraction fidelity. In this paper, we address the challenge of inversion errors and propose ROAR, a dual-domain optimization-based framework designed to mitigate errors arising from two key sources: 1) Latent-domain errors, which accumulate across inversion steps due to inherent approximation assumptions. 2) Pixel-domain errors, which result from channel distortions such as JPEG compression. To tackle these issues, we introduce two novel components: A \textbf{Regeneration-based Optimization (RO)} mechanism, which incorporates an optimizable starting latent to minimize latent-domain errors; A Mixture of Experts (MoE)-based \textbf{distortion-adaptive restoration (AR)} network, which effectively recovers watermarked distributions from pixel-level distortions.Extensive experiments demonstrate that ROAR significantly reduces inversion errors and enhances watermark extraction robustness, thereby improving the reliability of generative image watermarking.
Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
Yi Chen · Yuying Ge · Weiliang Tang · Yizhuo Li · Yixiao Ge · Mingyu Ding · Ying Shan · Xihui Liu
Recent developments in Large Language Models (LLMs) pre-trained on extensive corpora have shown significant success in various natural language processing (NLP) tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich "corpus", can a similar generative pre-training approach be effectively applied to enhance robot learning? The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks.Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer of learned motions to actual robot actions. To this end, we introduce Moto, which converts video content into latent Motion Token sequences by a Latent Motion Tokenizer, learning a bridging "language" of motion from videos in an unsupervised manner. We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge. After pre-training, Moto-GPT demonstrates the promising ability to produce semantically interpretable motion tokens, predict plausible motion trajectories, and assess trajectory rationality through output likelihood.To transfer learned motion priors to real robot actions, we implement a co-fine-tuning strategy that seamlessly bridges latent motion token prediction and real robot control. Extensive experiments show that the fine-tuned Moto-GPT exhibits superior robustness and efficiency on robot manipulation benchmarks, underscoring its effectiveness in transferring knowledge from video data to downstream visual manipulation tasks.
Automated Model Evaluation for Object Detection via Prediction Consistency and Reliability
Seungju Yoo · Hyuk Kwon · Joong-Won Hwang · Kibok Lee
Object detection is a fundamental task in computer vision that has received significant attention in recent years. Despite advances in training object detection models, evaluating their performance in real-world applications remains challenging due to the substantial costs associated with image annotation. To address this issue, we propose Prediction Consistency and Reliability (PCR) as an automated model evaluation (AutoEval) method for object detection. Our method is motivated by the observation that most existing object detection models generate many candidate predictions, which are subsequently filtered through non-maximum suppression (NMS). Specifically, we analyze 1) the consistency between the final and redundant predictions and 2) the reliability of these predictions determined by their confidence scores, and propose PCR by examining their relationships with object detection performance. Furthermore, to facilitate a more realistic assessment of AutoEval methods for object detection, we construct meta-datasets incorporating various corruptions. Experimental results demonstrate the superior performance of PCR compared to the existing AutoEval methods.
Counting Stacked Objects
Corentin Dumery · Noa Ette · Aoxiang Fan · Ren Li · Jingyi Xu · Hieu Le · Pascal Fua
Visual object counting is a fundamental computer vision task underpinning numerous real-world applications, from cell counting in biomedicine to traffic and wildlife monitoring. However, existing methods struggle to handle the challenge of stacked 3D objects in which most objects are hidden by those above them. To address this important yet underexplored problem, we propose a novel 3D counting approach that decomposes the task into two complementary subproblems - estimating the 3D geometry of the object stack and the occupancy ratio from multi-view images. By combining geometric reconstruction and deep learning-based depth analysis, our method can accurately count identical objects within containers, even when they are irregularly stacked. We validate our 3D Counting pipeline on diverse real-world and large-scale synthetic datasets, which we will release publicly to facilitate further research.
MIORe & VAR-MIORe: Benchmarks to Push the Boundaries of Restoration
George Ciubotariu · Zhuyun Zhou · Zongwei Wu · Radu Timofte
We introduce MIORe and VAR-MIORe, novel multi-task datasets that address critical limitations in current benchmarks for motion restoration tasks. Our datasets capture a broad spectrum of motion scenarios—including complex ego-camera movements, dynamic multi-subject interactions, and depth-dependent blur effects—using high-frame-rate (1000 FPS) acquisition and professional-grade optics. By averaging variable numbers of frames based on computed optical flow metrics, MIORe generates consistent motion blur while preserving sharp inputs for video frame interpolation and optical flow estimation. VAR-MIORe further extends this framework by spanning a variable range of motion magnitudes, from minimal to extreme, establishing the first benchmark of its kind. Together, these datasets provide high-resolution, scalable ground truth that challenges existing algorithms under both controlled and adverse conditions, paving the way for next-generation research in non-uniform deblurring, video interpolation, and optical flow analysis.
Soft Local Completeness: Rethinking Completeness in XAI
Ziv Weiss Haddad · Oren Barkan · Yehonatan Elisha · Noam Koenigstein
Completeness is a widely discussed property in explainability research, requiring that the attributions sum to the model’s response to the input. While completeness intuitively suggests that the model’s prediction is "completely explained" by the attributions, its global formulation alone is insufficient to ensure meaningful explanations. We contend that promoting completeness locally within attribution subregions, in a soft manner, can serve as a standalone guiding principle for producing high quality attributions. To this end, we introduce the concept of the completeness gap as a flexible measure of completeness and propose an optimization procedure that minimizes this gap across subregions within the attribution map. Extensive evaluations across various model architectures demonstrate that our method outperforms state-of-the-art explanation methods on multiple benchmarks.